What if that regression-discontinuity paper had only reported local linear model results, and with no graph?

We had an interesting discussion the other day regarding a regression discontinuity disaster.

In my post I shone a light on this fitted model:

Most of the commenters seemed to understand the concern with these graphs, that the upward slopes in the curves directly contribute to the estimated negative value at the discontinuity leading to a model that doesn’t seem to make sense, but I did get an interesting push-back that is worth discussing further. Commenter Sam wrote:

You criticize the authors for using polynomials. Here is something you yourself wrote with Guido Imbens on the topic of using polynomials in RD designs:

“We argue that estimators for causal effects based on such methods can be misleading, and we recommend researchers do not use them, and instead use estimators based on local linear or quadratic polynomials or other smooth functions.”

From p.15 of the paper:

“We implement the RDD using two approaches: the global polynomial regression and the local linear regression”

They show that their results are similar in either specification.

The commenter made the seemingly reasonable point that, since the authors actually did use the model that Guido and I recommended, and it gave the same results as what they found under the controversial model, what was my problem?

What if?

To put it another way, what if the authors had done the exact same analyses but reported them differently, as follows:

– Instead of presenting the piecewise quadratic model as the main result and the local linear model as a side study, they could’ve reversed the order and presented the local linear model as their main result.

– Instead of graphing the fitted discontinuity curve, which looks so bad (see graphs above), they could’ve just presented their fitted model in tabular form. After all, if the method is solid, who needs the graph?

Here’s my reply.

First, I do think the local linear model is a better choice in this example than the global piecewise quadratic. There are cases where a global model makes a lot of sense (for example in pre/post-test situations such as predicting election outcomes given previous election outcomes), but not in this case, when there’s no clear connection at all between percentage vote for a union and some complicated measures of stock prices. So, yeah, I’d say ditch the global piecewise quadratic model, don’t even include it in a robustness check unless the damn referees make you do it and you don’t feel like struggling with the journal review process.

Second, had the researchers simply fit the local linear model without the graph, I wouldn’t have trusted their results.

Not showing the graph doesn’t make the problem go away, it just hides the problem. It would be like turning off the oil light on your car so that there’s one less thing for you to be concerned about.

This is a point that the commenter didn’t seem to realize: The graph is not just a pleasant illustration of the fitted model, not just some sort of convention in displaying regression discontinuities. The graph is central to the modeling process.

One challenge with regression discontinuity modeling (indeed, applied statistical modeling more generally) as it is commonly practiced is that it is unregularized (with coefficients estimated using some variant of least squares) and uncontrolled (lots of researcher degrees of freedom in fitting the model). In a setting where there’s no compelling theoretical or empirical reason to trust the model, it’s absolutely essential to plot the fitted model against the data and see if it makes sense.

I have no idea what the data and fitted local linear model would look like, and that’s part of the problem here. (The research article in question has other problems, notably regarding data coding and exclusion, choice of outcome to study, and a lack of clarity regarding the theoretical model and its connection to the statistical model, but here we’re focusing on the particular issue of the regression being fit. These concerns do go together, though: if the data were cleaner and the theoretical structure were stronger, this can inspire more trust in a fitted statistical model.)

Taking the blame

Examples in statistics and econometrics textbooks (my own included) are too clean. The data come in, already tidy, and then the model is fit, and it works as expected, and some strong and clear conclusion comes out. You learn research methods in this way, and you can expect this to happen in real life, with some estimate or hypothesis test lining up with some substantive question, and all the statistical modeling just being a way to make that connection. And you can acquire the attitude that the methods just simply work. In the above example, you can have the impression that if you do a local linear regression and a bunch of robustness tests, that you’ll get the right answer.

Does following the statistical rules assure you (probabilistically) that you will get the right answer? Yes—in some very simple settings such as clean random sampling and clean randomized experiments, where effects are large and the things being measured are exactly what you want to know. More generally, no. More generally, there are lots of steps connecting data, measurement, substantive theory, and statistical model, and no statistical procedure blindly applied—even with robustness checks!—will be enuf on its own. It’s necessary to directly engage with data, measurement, and substantive theory. Graphing the data and fitted model is one part of this engagement, often a necessary part.

It’s a lot of pressure to write a book!

Regression and Other Stories is almost done, and I was spending a couple hours going through it starting from page 1, cleaning up imprecise phrasings and confusing points. . . .

One thing that’s hard about writing a book is that there are so many places you can go wrong. A 500-page book contains something like 1000 different “things”: points, examples, questions, etc.

Just for example, we have two pages on reliability and validity in chapter 2 (measurement is important, remember?). A couple of the things I wrote didn’t feel quite right, so I changed them.

And this got me thinking: any expert who reads our book will naturally want to zoom in on the part that he or she knows the most about, to check that we got things right. But with 1000 things, we’ll be making a few mistakes: some out-and-out errors and other places where we don’t explain things clearly and leave a misleading impression. It’s a lot of pressure to not want to get anything wrong.

We have three authors (me, Jennifer, and Aki), so that helps. And we’ve sent the manuscript to various people who’ve found typos, confusing points, and the occasional mistake. So I think we’re ok. But still it’s a concern.

I’ve reviewed a zillion books but only written a few. When I review a book, I notice its problems right away (see for example here and here). I’m talking about factual and conceptual errors, here, not typos. It’s not fun to think about being on the other side, to imagine a well-intentioned reviewer reading our book, going to a topic of interest, and being disappointed that we screwed up.

All I need is time, a moment that is mine, while I’m in between

You’re an ordinary boy and that’s the way I like it – Magic Dirt

Look. I’ll say something now, so it’s off my chest. I hate order statisics. I loathe them. I detest them. I wish them nothing but ill and strife. They are just awful. And I’ve spent the last god only knows how long buried up to my neck in them, like Jennifer Connelly forced into the fetid pool at the end of Phenomena.

It would be reasonable to ask why I suddenly have opinions about order statistics. And the answer is weird. It’s because of Pareto Smoothing Importance Sampling (aka PSIS aka the technical layer that makes the loo package work).

The original PSIS paper was written by Aki, Andrew, and Jonah. However there is a brand new sparkly version by Aki, Me, Andrew, Yuling, and Jonah that has added a pile of theory and restructured everything (the arXiv version will be updated soon). Feel free to read it. The rest of the blog post will walk you through some of the details.

What is importance sampling?

Just a quick reminder for those who don’t spend their life thinking about algorithms. The problem at hand is estimating the expectation I_h=\mathbb{E}(h(\theta)) for some function h when \theta\sim p(\theta).  If we could sample directly from p(\theta) then the Monte Carlo estimate of the expectation would be

\frac{1}{S}\sum_{s=1}^Sh(\theta_s),  where \theta_s\stackrel{\text{iid}}{\sim}p(\theta).

But in a lot of real life situations we have two problems with doing this directly: firstly it is usually very hard to sample from p(\theta). If there is a different distribution that we can sample from, say g, then we can use the following modification of the Monte Carlo estimator

I_h^S= \frac{1}{S}\sum_{s=1}^S\frac{p(\theta_s)}{g(\theta_s)}h(\theta_s),

 where \theta_s are iid draws from g(\theta). This is called an importance sampling estimator. The good news is that it always converges in probability to the true expectation. The bad news is that it is a random variable and it can have infinite variance.

The second problem is that often enough we only know the density p(\theta) up to a normalizing constant, so if f(\theta)\propto p(\theta), then the following self-normalized importance sampler is useful

I_h^S= \frac{\sum_{s=1}^Sr(\theta_s)h(\theta_s)}{\sum_{s=1}^Sr(\theta_s)},

where the importance ratios are defined as

r_s=r(\theta_s) = \frac{f(\theta_s)}{g(\theta_s)},

where again \theta_s\sim g. This will converge to the correct answer as long as \mathbb{E}(r_s)<\infty.  For the rest of this post I am going to completely ignore self-normalized importance samplers, but everything I’m talking about still holds for them.

So does importance sampling actually work?

Well god I do hope so because it is used a lot. But there’s a lot of stuff to unpack before you can declare something “works”. (That is a lie, of course, all kinds of people are willing to pick a single criterion and, based on that occurring, declaring that it works. And eventually that is what we will do.)

First things first, an importance sampling estimator is a sum of independent random variables. We may well be tempted to say that, by the central limit theorem, it will be asymptotically normal. And sometimes that is true, but only if the importance weights have finite variance. This will happen, for example, if the proposal distribution g has heavier tails than the target distribution p.

And there is a temptation to stop there. To declare that if the importance ratios have finite variance then importance sampling works. That. Is. A. Mistake.

Firstly, this is demonstrably untrue in moderate-to-high dimensions. It is pretty easy to construct examples where the importance ratios are bounded (and hence have finite variance) but there is no feasible number of samples that would give small variance. This is a problem as old as time: just because the central limit theorem says the error will be around \sigma/\sqrt{S}, that doesn’t mean that \sigma won’t be an enormous number.

And here’s the thing: we do not know \sigma and our only way to estimate it is to use the importance sampler. So when the importance sampler doesn’t work well, we may not be able to get a decent estimate of the error. So even if we can guarantee that the importance ratios have finite variance (which is really hard to do in most situations), we may end up being far too optimistic about the error.

Chatterjee and Diaconis recently took a quite different route to asking whether an importance sampler converges. They asked what the minimum sample size required to ensure, with high probability, that |I_h^S - I_h| is small (with high probability). They showed that you need approximately \exp(\mathbb{E}[r_s \log(r_s)]) samples and this number can be large.  This quantity is also quite hard to compute (and they proposed another heuristic, but that’s not relevant here), but it is going to be important later.

Modifying importance ratios

So how do we make importance sampling more robust. A good solution is to somehow modify the importance ratios to ensure they have finite variance. Ionides proposed a method called Truncated Importance Sampling (TIS) where the importance ratios are replaced with truncated weights w_s=\max\{r_s,\tau_S\}, for some sequence of thresholds \tau_S\rightarrow\infty as S\rightarrow\infty.  The resulting TIS estimator is

I_h^S= \frac{1}{S}\sum_{s=1}^Sw_s h(\theta_s).


A lot of real estate in Ionides’ paper is devoted to choosing a good sequence of truncations. There’s theory to suggest that it depends on the tail of the importance ratio distribution. But the suggested choice of truncation sequence is \tau_S=C\sqrt{S}, where C is the normalizing constant of which is one when using ordinary rather than self-normalized importance sampling. (For the self normalized version, Appendix B suggests taking C as the sample mean of the importance ratios, but the theory only works for deterministic truncations.)

This simple truncation guarantees that TIS is asymptotically unbiased, has finite variance that asymptotically goes to zero, and (with some caveats) is asymptotically normal.

But, as we discussed above, none of this actually guarantees that TIS will work for a certain problem. (It does work asymptotically for a vast array of problems and does a lot better that ordinary importance sampler, but no simple truncation scheme can overcome a poorly chosen proposal distribution. And most proposal distributions in high dimensions are poorly chosen.)

Enter Pareto-Smoothed Importance Sampling

So a few years ago Aki and Andrew worked on an alternative to TIS that would make things even better. (They originally called it the “Very Good Importance Sampling”, but then Jonah joined the project and ruined the acronym.) The algorithm they came up with was called Pareto-Smoothed Importance Sampling (henceforth PSIS, the link is to the three author version of the paper).

They noticed that TIS basically replaces all of the large importance ratios with a single value \tau_S. Consistent with both Aki and Andrew’s penchant for statistical modelling, they thought they could do better than that (Yes. It’s the Anna Kendrick version. Deal.)

PSIS is based on the principle the idea that, while using the same value for each extreme importance ratio works, it would be even better to model the distribution of extreme importance ratios! The study of distributions of extremes of independent random variables has been an extremely important (and mostly complete) part of statistical theory. This means that we know things.

One of the key facts of extreme value theory is that the distribution of ratios larger than some sufficiently large threshold u  approximately has a generalized Pareto distribution (gPd). Aki, Andrew, and Jonah’s idea was to fit a generalized Pareto distribution to the M largest importance ratios and replace the upper weights with appropriately chosen quantiles of the fitted distribution. (Some time later, I was very annoyed they didn’t just pick a deterministic threshold, but this works better even if it makes proving things much harder.)

They learnt a few things after extensive simulations. Firstly, this almost always does better than TIS (the one example where it doesn’t is example 1 in the revised paper). Secondly, the gPd has two parameters that need to be estimated (the third parameter is an order statistic of the sample. ewwwww) And one of those parameters is extremely useful!

The shape parameter (or tail parameter) of the gPd, which we call k, controls how many moments the distribution has. In particular, a distribution who’s upper tail limits to a gPd with shape parameter k has at most k^{-1} finite moments. This means that if k<1/2 then an importance sampler will have finite variance.

But we do not have access to the true shape parameter. We can only estimate it from a finite sample, which gives us \hat{k}, or, as we constantly write, “k-hat”. The k-hat value has proven to be an extremely useful diagnostic in a wide range of situations. (I mean, sometimes it feels that every other paper I write is about k-hat. I love k-hat. If I was willing to deal with voluntary pain, I would have a k-hat tattoo. I once met a guy with a nabla tattooed on his lower back, but that’s not relevant to this story.)

Aki, Andrew, and Jonah’s extensive simulations showed something that may well have been unexpected: the value of k-hat is a good proxy for the quality of PSIS. (Also TIS, but that’s not the topic). In particular, if k-hat was bigger than around 0.7 it became massively expensive to get an accurate estimate. So we can use k-hat to work out if we can trust our PSIS estimate.

PSIS ended up as the engine driving the loo package in R, which last time I checked had around 350k downloads from the RStudio CRAN mirror. It works for high-dimensional problems and can automatically assess the quality of an importance sampler proposal for a given realization of the importance weights.

So PSIS is robust, reliable, useful, has R and Python packages, and the paper was full of detailed computational experiments that showed that it was robust, reliable, and useful even for high dimensional problems. What could possibly go wrong?

What possibly went wrong


It works, but where is the theory?

I wasn’t an author so it would be a bit weird for me to do a postmortem on the reviews of someone else’s paper. But one of the big complaints was that Aki, Andrew, and Jonah had not shown that PSIS was asymptotically unbiased, had finite vanishing variance, or that it was asymptotically normal.

(Various other changes of emphasis or focus in the revised version are possibly also related to reviewer comments from the previous round, but also to just having more time.)

These things turn out to be tricky to show. So Aki, Andrew, and Jonah invited me and Yuling along for the ride.

The aim was to restructure the paper, add theory, and generally take a paper that was very good and complete and add some sparkly bullshit. So sparkly bullshit was added. Very slowly (because theory is hard and I am not good at it).

Justifying k-hat < 0.7

Probably my favourite addition to the paper is due to Yuling, who read the Chatterjee and Diaconis  paper and noticed that we could use their lower bound on sample size to justify k-hat. The idea is that it is the tail of r_s that breaks the importance sampler. So if we make the assumption that the entire distribution of r_s is generalized Pareto with shape parameter k, we can actually compute the minimum sample size for a particular accuracy from ordinary importance sampling. This is not an accurate sample size calculation, but should be ok for an order-of-magnitude calculation.

The first thing we noticed is, consistent with the already existing experiments, the error in importance sampling (and TIS and PSIS) increases smoothly as k passes 0.5 (in particular the finite-sample behaviour does not fall off a cliff the moment the variance isn’t finite). But the minimum sample size starts to increase very rapidly as soon as k got bigger than about 0.7. This is consistent with the experiments that originally motivated the 0.7 threshold and suggests (at least to me) that there may be something fundamental going on here.

We can also use this to justify the threshold on k-hat as follows. The method Aki came up with for estimating k-hat is (approximately) Bayesian, so we can interpret the k-hat at a value selected so that the data is consistent with M independent samples from a gPd with shape parameter k-hat. So a k-hat value that is bigger than 0.7 can be interpreted loosely as saying that the extreme importance ratios could have come from a distribution that has a tail that is too heavy for PSIS to work reliably.

This is what actually happens in high dimensions (for an example we have that has bounded ratios and hence finite variance). With a reasonable sample size, the estimator for k-hat simply cannot tell that the distribution of extreme ratios has a large but finite variance rather than an infinite variance. And this is exactly what we want to happen! I have no idea how to formalized this intuition, but nevertheless it works.

So order statistics

It turned out that–even though it is quite possible that other people would not have found proving unbiasedness and finite variance hard–I found it very hard. Which is quite annoying because the proof for TIS was literally 5 lines.

What was the trouble? Aki, Andrew, and Jonah’s decision to choose the threshold as the Mth largest importance ratio. This means that the threshold is an order statistic and hence is not independent of the rest of the sample. So I had to deal with that.

This meant I had to read an absolute tonne of papers about order statistics. These papers are dry and technical and were all written between about 1959 and 1995 and at some later point poorly scanned and uploaded to JSTOR. And they rarely answered the question I wanted them to. So basically I am quite annoyed with order statistics.

But the end point is that, under some conditions, PSIS is asymptotically unbiased and has finite, vanishing variance.

The conditions are a bit weird, but are usually going to be satisfied.  Why are they weird? Well…

PSIS is TIS with an adaptive threshold and bias correction

In order to prove asymptotic properties of PSIS, I used the following representation of the PSIS estimator


where the samples \theta_s have been ordered so that r(\theta_1)\leq r(\theta_2)\leq\ldots\leq r(\theta_S) and the weights \tilde{w}_m are deterministic (and given in the paper). They are related to the quantile function for the gPd.

The first term is just TIS with random threshold \tau_S=r(\theta_{(S-M+1):S}), while the second term is an approximation to the bias. So PSIS has higher variance than TIS (because of the random truncation), but lower bias (because of the second term) and this empirically usually leads to lower mean-square error than TIS.

But that random truncation is automatically adapted to the tail behaviour of the importance ratios, which is an extremely useful feature!

This representation also gives hints as to where the ugly conditions come from. Firstly, anything that is adaptive is much harder to prove things about than a non-adaptive method, and the technical conditions that we need to be able to adapt our non-adaptive proof techniques are often quite esoteric. The idea of the proof is to show that, conditional on r(\theta_{(S-M+1):S})=U, all of the relevant quantities go to zero (or are finite) with some explicit dependence on U. The proof of this is very similar to the TIS proof (and would be exactly the same if the second term wasn’t there).

Then we need to let vary and hope it doesn’t break anything. The technical conditions can be split into the ones needed to ensure r(\theta_{(S-M+1):S})=U behaves itself as gets big; the ones needed to ensure that h(\theta) doesn’t get too big when the importance ratios are large; and the ones that control the last term.

Going in reverse order, to ensure the last term is well behaved we need that h is square-integrable with respect to the proposal g in addition to the standard assumption that its square integrable with respect to the target p.

We need to put growth conditions on because we are only modifying the ratios, which does not help if h is also enormous out in the tails. These conditions are actually very easy to satisfy for most problems I can think of, but almost certainly there’s some one out there with a h that grows super-exponentially just waiting to break PSIS.

The final conditions are just annoying. They are impossible to verify in practice, but there is a 70 year long literature that coos reassuring phrases like “this almost always holds” into our ears. These conditions are strongly related to the conditions needed to estimate k correctly (using something like the Hill estimator). My guess is that these conditions are not vacuous, but are relatively unimportant for finite samples, where the value of k-hat should weed out the catastrophic cases.

What’s the headline

With some caveats, PSIS is asymptotically unbiased; has finite, vanishing variance; and a variant of it is asymptotically normal as long as the importance ratios have more than (1+\delta)-finite moments. But it probably won’t be useful unless it has at least 1/0.7 = 1.43 moments.

And now we send it back off into the world and see what happens

The garden of 603,979,752 forking paths

Amy Orben and Andrew Przybylski write:

The widespread use of digital technologies by young people has spurred speculation that their regular use negatively impacts psychological well-being. Current empirical evidence supporting this idea is largely based on secondary analyses of large-scale social datasets. Though these datasets provide a valuable resource for highly powered investigations, their many variables and observations are often explored with an analytical flexibility that marks small effects as statistically significant . . . we address these methodological challenges by applying specification curve analysis (SCA) across three large-scale social datasets . . . to rigorously examine correlational evidence for the effects of digital technology on adolescents. The association we find between digital technology use and adolescent well-being is negative but small, explaining at most 0.4% of the variation in well-being. Taking the broader context of the data into account suggests that these effects are too small to warrant policy change.

They continue:

SCA is a tool for mapping the sum of theory-driven analytical decisions that could justifiably have been taken when analysing quantitative data. Researchers demarcate every possible analytical pathway and then calculate the results of each. Rather than reporting a handful of analyses in their paper, they report all results of all theoretically defensible analyses . . .

Here’s the relevant methods paper on specification curve analysis, by Uri Simonsohn, Joseph Simmons, and Leif Nelson, which seems similar to what Sara Steegen, Francis Tuerlinckx, Wolf Vanpaemel and I called the multiverse analysis.

It makes sense that a good idea will come up in different settings with some differences in details. Forking paths in methodology as well as data coding and analysis, one might say.

Anyway, here’s what Orben and Przybylski report:

Three hundred and seventy-two justifiable specifications for the YRBS, 40,966 plausible specifications for the MTF and a total of 603,979,752 defensible specifications for the MCS were identified. Although more than 600 million specifications might seem high, this number is best understood in relation to the total possible iterations of dependent (six analysis options) and independent variables (224 + 225 – 2 analysis options) and whether co-variates are included (two analysis options). . . . The number rises even higher, to 2.5 trillion specifications, for the MCS if any combination of co-variates (212 analysis options) is included.

Given this, and to reduce computational time, we selected 20,004 specifications for the MCS.

I love it that their multiverse was so huge they needed to drastically prune it by only including 20,000 analyses.

How did they choose this particular subset?

We included specifications of all used measures per se, and any combinations of measures found in the previous literature, and then supplemented these with other randomly selected combinations. . . . After noting all specifications, the result of every possible combination of these specifications was computed for each dataset.

I wonder if they could’ve found even more researcher degrees of freedom by considering rules for data coding and exclusion, which is what we focused on in our multiverse paper. (I’m also thinking of the article discussed the other day that excluded all but 687 out of 5342 observations.)

Ultimately I think the right way to analyze this sort of data is through a multilevel model, not a series of separate estimates and p-values.

But I do appreciate that they went to the trouble to count up 603,979,752 paths. This is important, because I think a lot of people don’t realize the weakness of many published claims based on p-values (an issue we discussed in a recent comment thread here, when Ethan wrote: “I think lots of what’s discussed on this blog and a cause of common lay errors in probability comes down to, ‘It’s tempting to believe that you can’t get all of this just by chance, but you can.’”).

Book Review: Good to Go, by Christine Aschwanden

This is a book review. It is by Phil Price. It is not by Andrew.

The book is Good To Go: What the athlete in all of us can learn from the strange science of recovery. By Christine Aschwanden, published by W.W. Norton and Company. The publisher offered a copy to Andrew to review, and Andrew offered it to me as this blog’s unofficial sports correspondent.

tldr: This book argues persuasively that when it comes to optimizing the recovery portion of the exercise-recover-exercise cycle, nobody knows nuthin’ and most people who claim to know sumthin’ are wrong. It’s easy to read and has some nice anecdotes. Worth reading if you have a special interest in the subject, otherwise not. Full review follows.

The book is about ‘recovery’. In the context of the book, recovery is what you do between bouts of exercise; or, if you prefer, exercise is what you do between periods of recovery. The book has great blurbs. “A tour de force of great science journalism”, writes Nate Silver (!). “…a definitive tour through a bewildering jungle of scientific and pseudoscientific claims…”, writes David Epstein. “…Aschwanden makes the mid-boggling world of sports recovery a hilarious adventure”, says Olympic gold medal skier Jessie Diggins. With blurbs like these I was expecting a lot…although once I realized Aschwanden works at FiveThirtyEight, I downweighted the Silver blurb appropriately. Even so, I expected too much: the book is fine but ultimately rather unsatisfying. It is fairly interesting and sometimes amusing, but there’s only so much any author can do with the subject given the current state of knowledge, which is this: other than getting enough sleep and eating enough calories, nobody knows for sure what helps athletes recover between events or training sessions better than just living a normal life. The book is mostly just 300 pages of elucidating and amplifying that disappointing state of knowledge.

The author, Aschwanden, went to a lot of trouble, conducting hundreds of interviews, reading hundreds of scientific or quasi-scientific or pseudo-scientific papers, and in some cases subjecting herself to treatments in the interest of journalism (a sensory deprivation tank! Tom Brady’s magic pajamas! A cryogenic chamber!…) If the subject of athletic recovery is especially interesting to you then hey, it’s a fine book, plenty of good stuff in there, $30 well spent for a two or three hours of information and amusement.

For readers of this blog — and maybe for everybody — the first couple of chapters are the best ones, because they provide some insights that can apply to many areas of science and statistical analysis. The first chapter explains what happened when Aschwanden became interested in whether beer is good, bad, or indifferent as a ‘recovery drink.’ She has a friend who was a researcher at a lab that researches human performance and when she brought the question to him he was enthusiastic about studying this issue, so they did. They designed and performed a study that is typical (all too typical) of studies that address this kind of issue: only 10 participants, with tests spanning a couple of days. Do some hard exercise, then drink regular beer or non-alcoholic beer. The next day “run to exhaustion” (following a standard protocol) and afterwards drink whichever beverage you didn’t drink the previous day. The next day, run to exhaustion again. Quantify the time to run to exhaustion at the specified level of effort. The study found no ‘statistically significant’ difference between real beer and fake beer for the contestants as a whole, or for male participants, but for women there was a statistically significant difference, with performance better after real beer! And for men there was a difference large enough to be substantively important if true, but not statistically significant. Fortunately, Aschwanden is no dummy. She doesn’t mention the ‘garden of forking paths’, but does recognize some other major methodological problems with the study. As she puts it: “There was only one problem: I didn’t believe it. Trust me — I wanted our study to show that beer was great for runners, really, I did. Yet my experience as a participant… left me feeling skeptical of our result, and the episode helped me understand and recognize some pitfalls that I’ve found to be common among sports performance studies.” And then she gives a few paragraphs that do a great job of illustrating why it is really hard to get objective measures of human performance for a study like this, and why it matters. The upshot is that in this study the researchers are fitting noise. And the problems that came up in this study are common, indeed nearly ubiquitous, in this sort of research. Disappointingly, even this chapter doesn’t show any data or any hard numbers. There’s not a plot or table in the book.

The second chapter discusses hydration (and over-hydration), starting off with a discussion of the creation and marketing of Gatorade and going on from there. As with every chapter, Aschwanden mixes anecdotes, history, and results from scientific studies, and pulls everything together with her own evaluation. It’s a good formula and makes for a readable book. The hydration chapter is typical in that it illustrates the extent to which marketing and a smattering of scientific research led to a widespread perception among athletes that later turned out either not to be true or to be more nuanced than was first thought. In fact, according to Aschwanden and backed up by many studies she cites, in contrast to what many athletes and coaches have believed over the past thirty years or so our bodies can tolerate moderate dehydration with very little problem, and optimal hydration for a many athletes and many activities turns out to involve a lot less drinking than most people (including most athletes and coaches) thought for decades. And it’s probably better to be rather dehydrated than to be rather over-hydrated.

I can’t resist adding my own little hydration story. A couple of years ago, on a very hot day I rode my bike on a hilly route to our local mountain (Mount Diablo), rode up it and back down, stopped at the bottom for food, and then rode back home. The ride was about 100 miles and the temperature was in the high nineties. Each time I stopped for water, I filled and chugged one of my water bottles, then filled both of them and continued on, draining both bottles by the time I got to the next water stop. Knowing the capacity of my bottles and the number of times I stopped, it’s easy to count how much I drunk that day. I also had a large milkshake and a coke at my lunch stop, as well as something like a pound of food. On that day I drank 17 pounds of fluid. I weighed myself when I got home and found that I had lost 8 pounds. I had not urinated during the day, and didn’t do so for several hours after I got home. What’s the point of telling you this? I dunno; I just think it’s really interesting. In one long day I sweated or exhaled more than 25 pounds of water! I still find it hard to believe..although it does jibe with one of Gatorade’s early marketing campaigns, which promoted the idea that athletes should drink 40 ounces per hour, and not necessarily on a brutally hot day. But Aschwanden has both anecdotes and studies in which successful athletes drank much less, and about some athletes getting in bad medical trouble by drinking too much.  The point isn’t that endurance athletes shouldn’t drink, it’s that they shouldn’t obsess about drinking as long as they don’t get too thirsty. Aschwanden says it has long been conventional wisdom that in an athletic event you should drink before you’re thirsty, and drink enough that you never become thirsty, but there’s actually no evidence that that leads to better performance than simply drinking when you feel like it.

Another chapter covers the current fad for ice baths, cryogenic chambers, ice-water compression boots, and so on. No real evidence they help, no real evidence they hurt.

Another chapter covers the current fad for infrared treatments (heat baths, saunas, ‘infrared’ saunas, Tom Brady’s magic thermal underwear, etc.) No real evidence they help, no real evidence they hurt. Oh, and not only have the claims about thermal underwear not been evaluated by the Food and Drug Administration, they’ve apparently never been evaluated by a physicist either, because they’re ridiculous. If you buy the underwear you deserve to be mocked, and you should be. If no one else will do it for you, send me an email and I’ll do the mocking.

Massage? No real evidence it helps, no real evidence it hurts. That said, I intend to continue to get occasional massages from my next door neighbor, Cyrus Poitier, who is an elite sports masseur. He travels with the men’s national wrestling team and the women’s swim team, and is one of the US Olympic Team’s masseurs. Like most of Cyrus’s clients, I don’t go to Cyrus for feel-good massages — in fact they are usually quite painful — but instead I go when I have some soreness or tightness that I haven’t been able to get rid of on my own, and I do think his massages help. But do they really, in the sense of helping me perform better athletically, and, if so, how much? According to Aschwanden there’s no evidence, or only weak evidence, that they help at all. But I would swear they help me! And he has many elite athletes as clients. So are all of us wrong? Well, maybe we are, or maybe we’re right that the massages help but the effect is rather small. Or maybe they help the performance of those of us with some musculo-skeletal issues but harm the performance of people with other issues. The right way to answer this is with data, and according to Aschwanden the existing data aren’t adequate to the task.

Every ‘recovery modality’ in the book has a bunch of proponents, including some elite athletes who swear by it. Every one of the modalities has a bunch of individuals or companies promoting it and telling people it works, usually buttressed by questionable studies like Aschwanden’s beer study.  And just about every one of the recovery methods or substances has some skeptics who think it’s all hype.

And ultimately that’s the problem with Aschwanden’s book, though it’s not her fault: at the moment it’s impossible to know what works, and how well. She says this herself, towards the end of the book: “After exploring a seemingly endless array of recovery aids, I’ve come to think of them as existing on a sort of evidence continuum. At one end you’ve got sleep — the most potent recovery tool ever recovered (and one that money can’t buy). At the other end lies a pile of faddish products like hydrogen water and oxygen inhalers, which an ounce of common sense can tell you are mostly useless… Most things, however, lie somewhere in the vast middle — promising but unproven.” For someone like me, that’s a good reason to ignore just about all of the unproven stuff: even if something would improve my performance fairly substantially — let’s say a 5% increase in speed on my hardest bike rides — that wouldn’t change my life in a noticeable way. But for a competitive athlete, even 0.5% could be the difference between a gold medal and being off the podium, or being a pro vs an amateur who never quite breaks through. So there are always going to be people promoting this stuff, and there will always be athletes willing to give it a try.

Although firm conclusions about effectiveness are hard to come by, there’s plenty of interesting stuff in the book. For example, one of the many anecdotes concerns sprinter Usain Bolt. At the 2008 Olympics in Beijing, Bolt wasn’t happy with any of the unfamiliar food available to him at the athletes’ cafeteria, so he went to McDonalds and ate Chicken McNuggets. Every day. For lunch and dinner. (He also ate a small amount of greens drenched in salad dressing). According to Bolt’s memoir, he ate about 100 nuggets every 24 hours, adding up to about 1000 chicken nuggets over the course of the ten days he competed in the 100m, 200m, and 4x100m relay (with multiple heats in each, plus the finals). He won gold medals in all of them. As Aschwanden says, “Those chicken nuggets were adequate, if not ideal, fuel to power him through his nine heats, and to help him recover his energy in between them. Feeling satiated and not worrying about gastrointestinal issues are surely worth a lot to an athlete preparing for his most important events of the season. Would Bolt have performed better eating some other recovery foods? Maybe. The better question is: How much difference would it make?”

By the way, a popular saying among the kind of people who read this blog is “The plural of ‘anecdote’ is not ‘data.’” I liked that saying too, the first time I heard it, but the more I think about it the less I agree with it. Of course it’s literally true that ‘data’ is not the plural of ‘anecdote’, since the plural of ‘anecdote’ is ‘anecdotes.’ But each (true) anecdote does provide a data point of sorts. A sprinter won three gold medals on a diet consisting almost entirely of Chicken McNuggets, and his 100m time was a world record even though he didn’t run all the way through the tape. That really does set an upper limit on how deleterious a week of Chicken McNugget consumption is, at least to Usain Bolt. As far as data go, that anecdote is probably more informative than the quantitative results of Aschwanden’s 10-participant beer study, no matter how carefully the study was conducted.

One of the good things about Aschwanden’s book is that she puts the pieces together for us. She’s smart, she’s a former elite athlete herself (a professional cross-country skier), she talked to hundreds of people, she read lots of scientific studies, and she formed well-informed beliefs about everything she writes about. Only a tiny portion of those interviews and studies can fit in the book, but I trust her judgment well enough to think I’d probably reach most of the same conclusions she did, so I appreciate the fact that she does summarize her beliefs. A few key ones are: (1) ‘recovery’ involves both mind and body, and stress of all kinds — physical, mental, and emotional — hurts recovery of both mind and body. (2) Sleep is especially important to recovery; relaxation is too. If an athlete’s recovery routine is itself a source of stress, it’s counterproductive. (3) Under-eating is bad, and is worse than eating non-optimally. (4) The timing of food intake is unimportant unless you have a short break between events. If you finish an event and you have another one in a few hours, eating the right thing at the right time is critical. But if you aren’t competing again for 24 hours or more, there is no ‘nutrition window’, there’s a nutrition ‘barn door’, in the words of one researcher she quotes. (5) Other than getting enough sleep and enough relaxation, and eating enough to replenish glycogen supplies and calories in time for your next event, nearly nothing else is definitively known to be beneficial compared to just living an ordinary life between events. (6) Overtraining is real thing, with both physical and mental components, and overtraining can be worse than undertraining. (7) With regard to specific ‘recovery modalities’: Massage might or might not help; ice baths might or might not help (and in fact might harm recovery a little); various food supplements might or might not help; heat in various forms might or might not help; ibuprofen and other anti-inflammatories probably do a little physical harm in most people, but most athletes refuse to believe it; stretching probably doesn’t help most people. (8) Different things work differently for different people, so following the same recovery routine as your sports idol might not work for you; (9) Some recovery methods, maybe a lot of them, really do help some people simply due to the ‘placebo effect’, and there’s nothing wrong with that: if it helps, it helps.

If any of these points seem odd or wrong or questionable to you, then I suggest reading the book, because Ashwanden explains why she has adopted her viewpoint. If you agree with all of them but want support for them, that’s another reason to read the book. If you agree with them all, shrug, and say “yeah, that’s pretty much what I figured” then you can skip the book unless you are interested in some interesting stories like the one about Bolt.

AnnoNLP conference on data coding for natural language processing

This workshop should be really interesting:

Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.

I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).

The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.

The past

Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.

As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.

Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.

The present

Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.

Editorial Aside: Every field should move to journals like TACL. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.

The future

I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.

P.S. Cultural consensus theory

I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.

Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.

In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!

We should be open-minded, but not selectively open-minded.

I wrote this post awhile ago but it just appeared . . .

I liked this line so much I’m posting it on its own:

We should be open-minded, but not selectively open-minded.

This is related to the research incumbency effect and all sorts of other things we’ve talked about over the years.

There’s a Bayesian argument, or an implicitly Bayesian argument for believing everything you read in the tabloids, and the argument goes as follows: It’s hard to get a paper published, papers in peer-reviewed journals typically really do go through the peer review process, so the smart money is to trust the experts.

This believe-what-you-read heuristic is Bayesian, but not fully Bayesian: it does not condition on new information. The argument against Brian Wansink’s work is not that it was published in the journal Environment and Behavior. The argument against it is that the work has lots of mistakes, and then you can do some partial pooling, looking at other papers by this same author that had lots of mistakes.

Asymmetric open-mindedness—being open to claims published in scientific journals and publicized on NPR, Ted, etc., while not at all being open to their opposites—is, arguably, a reasonable position to take. But this position is only reasonable before you look carefully at the work in question. Conditional on that careful look, the fact of publication provides much less information.

To put it another way, defenders of junk science, and even people who might think of themselves as agnostic on the issue, are making the fallacy of the one-sided bet.

Here’s an example.

Several years ago, the sociologist Satoshi Kanazawa claimed that beautiful parents were more likely to have girl babies. This claim was reproduced by the Freakonomics team. It turns out that underlying statistical analysis was flawed, and was was reported was essentially patterns in random numbers (the kangaroo problem).

So, fine. At this point you might say: Some people believe that beautiful parents are more likely to have girl babies, while other people are skeptical of that claim. As an outsider, you might take an intermediate position (beautiful parents might be more likely to have girl babies), and you could argue that Kanazawa’s work, while flawed, might still be valuable by introducing this hypothesis.

But that would be a mistake; you’d be making the fallacy of the one-sided bet. If you want to consider the hypothesis that beautiful parents are more likely to have girl babies, you should also consider the hypothesis that beautiful parents are more likely to have boy babies. If you don’t consider both possibilities, you’re biasing yourself—and you’re also giving an incentive for future Wansinks to influence policy through junk science.

P.S. I also liked this line that I gave in response to someone who defended Brian Wansink’s junk science on the grounds that “science has progressed”:

To use general scientific progress as a way of justifying scientific dead-end work . . . that’s kinda like saying that the Bills made a good choice to keep starting Nathan Peterman, because Patrick Mahomes has been doing so well.

A problem I see is that the defenders of junk science are putting themselves in the position where they’re defending Science as an entity.

A supposedly fun thing I definitely won’t be repeating (A Pride post)

“My friends and I don’t wanna be here if this isn’t an actively trans-affirming space. I’m only coming if all my sisters can.” – I have no music for you today, sorry. But I do have an article about cruise ships 

(This is obviously not Andrew)

A Sunday night quickie post, from the tired side of Toronto’s Pride weekend. It’s also Pride month, and it’s 50 years on Friday since the Stonewall riots, which were a major event in LGBT+ rights activism in the US and across the world. Stan has even gone rainbow for the occasion. (And many thanks to the glorious Michael Betancourt who made the badge.)

This is a great opportunity for a party and to see Bud Lite et al.  pretend they care deeply about LGBTQIA+ people. But really it should also be a time to think about how open workplaces, departments, universities, conferences, any other place of work are to people who are lesbian, gay, bisexual, transgender, non-binary, two-spirit, gender non-conforming, intersex, or who otherwise lead lives (or wish to lead lives) that lie outside the cisgender, straight world that the majority occupies.  People who aren’t spending a bunch of time trying to hide aspects of their life are usually happier and healthier and better able to contribute to things like science than those who are.

Which I guess is to say that diversity is about a lot more than making sure that there aren’t zero women as invited speakers. (Or being able to say “we invited women but they all said no”.) Diversity is about racial and ethnic diversity, diversity of gender, active and meaningful inclusion of disabled people, diversity of sexuality, intersections of these identities, and so much more. It is not an accounting game (although zero is still a notable number).

And regardless of how many professors or style guides or blogposts tell you otherwise, there is no single gold standard absolute perfect way to deliver information. Bring yourself to your delivery. Be gay. Be femme. Be masc. Be boring. Be sports obsessed. Be from whatever country and culture you are from. We can come along for the journey. And people who aren’t willing to are not worth your time.

Anyway, I said a pile of words that aren’t really about this but are about this for a podcast, which if you have not liked the previous three paragraphs you will definitely not enjoy. Otherwise I’m about 17 mins in (but the story about the alligators is also awesome.) If you do not like adult words, you definitely should not listen.

In the spirit of Pride month please spend some time finding love for and actively showing love to queer and trans folk. And for those of you in the UK especially (but everywhere else as well), please work especially hard to affirm and love and care for and support Trans* people who are under attack on many fronts. (Not least the recent rubbish about how being required to use people’s correct names and pronouns is somehow an affront to academic freedom, as if using the wrong pronoun or name for a student or colleague is an academic position.)

And should you find yourself with extra cash, you can always support someone like Rainbow Railroad. Or your local homeless or youth homeless charity. Or your local sex worker support charity like SWOP Behind Bars or the Sex Workers Project from the Urban Justice Centre. (LGBTQ+ people have much higher rates of homelessness [especially youth homelessness] and survival sex work than straight and cis people.)

Anyway, that’s enough for now. (Or nowhere near enough ever, but I’ve got other things to do.)  Just recall what the extremely kind and glorious writer and academic Anthony Olivera said in the Washington Post: (Also definitely read this from him because it’s amazing)

We do not know what “love is love” means when you say it, because unlike yours, ours is a love that has cost us everything. It has, in living memory, sent us into exterminations, into exorcisms, into daily indignities and compromises. We cannot hold jobs with certainty nor hands without fear; we cannot be sure when next the ax will fall with the stroke of a pen.

Hope you’re all well and I’ll see you again in LGBT+ wrath month. (Or, more accurately, some time later this week to talk about the asymptotic properties of PSIS.)


Freud expert also a Korea expert

I received the following email:

Dear Dr Andrew Gelman,

I am writing to you on behalf of **. I hereby took this opportunity to humbly request you to consider being a guest speaker on our morning radio show, on 6th August, between 8.30-9.00 am (BST) to discuss North Korea working on new missiles

We would feel honoured to have you on our radio show. having you as a guest speaker would give us and our viewers a great insight into this topic, we would greatly appreciate it if you could give us 10-15 minutes of your time and not just enhance our but also our views knowledge on this topic.

We are anticipating your reply and look forward to possibly having you on our radio show.

Kind regards,


Note – All interviews are conducted over the phone

Note – Timing can be altered between 7.30- 9.00 am (BST)




This email is CONFIDENTIAL and LEGALLY PRIVILEGED. If you are not the intended recipient of this email and its attachments, you must take no action based upon them, nor must you copy or show them to anyone. If you believe you have received this email in error, please email **

I don’t know which aspect of this email is more bizarre, that they sent me an unsolicited email that concludes with bullying pseudo-legal instructions, or that they think I’m an expert on North Korea (I guess from this post; to be fair, it seems that I know more about North Korea than the people who run the World Values Survey). Don’t they know that my real expertise is on Freud?

How much is your vote worth?

Tyler Cowen writes:

If it were legal, and you tried to sell your vote and your vote alone, you might not get much more than 0.3 cents.

It depends where you live.

If you’re not voting in any close elections, then the value of your vote is indeed close to zero. For example, I am a resident of New York. Suppose someone could pay me $X to switch my vote (or, equivalently, pay me $X/2 to not vote, or, equivalently, pay a nonvoter $X/2 to vote in a desired direction) in the general election for president. Who’d want to do that? There’s not much reason at all, except possibly for a winning candidate who’d like the public relations value of winning by an even larger margin, or for a losing candidate who’d like to lose by a bit less, to look like a more credible candidate next time, or maybe for some organization that would like to see voter turnout reach some symbolic threshold such as 50% or 60%.

If you’re living in a district with a close election, the story is quite different, as Edlin, Kaplan, and I discussed in our paper. In some recent presidential elections, we’ve estimated the ex ante probability of your vote being decisive in the national election (that is, decisive in your state, and, conditional on that, your state being decisive in the electoral college) as being approximately 1 in a million in swing states.

Suppose you live in one of those states? Then, how much would someone pay for your vote, if it were legal and moral to do so? I’m pretty sure there are people out there who would pay a lot more than 0.3 cents. If a political party or organization would drop, say, $100M to determine the outcome of the election, then it would be worth $10 to switch one person’s vote in one of those swing states.

We can also talk about this empirically. Campaigns do spend money to flip people’s votes and to get voters to turn out. They spend a lot more than 0.3 cents per voter. Now, sure, not all this is for the immediate goal of winning the election right now: for example, some of it is to get people to become regular voters, in anticipation of the time when their vote will make a difference. There’s a difference between encouraging people to turn out and vote (which is about establishing an attitude and a regular behavior) and paying for a single vote with no expectation of future loyalty. That said, even a one-time single vote should be worth a lot more than $0.03 to a campaign in a swing state.

tl;dr. Voting matters. Your vote is, in expectation, worth something real.