Multilevel structured (regression) and post-stratification

My enemies are all too familiar. They’re the ones who used to call me friend – Jawbreaker

Well I am back from Australia where I gave a whole pile of talks and drank more coffee than is probably a good idea. So I’m pretty jetlagged and I’m supposed to be writing my tenure packet, so obviously I’m going to write a long-ish blog post about a paper that we’ve done on survey estimation that just appeared on arXiv. We, in this particular context, is my stellar grad student Alex Gao, the always stunning Lauren Kennedy, the eternally fabulous Andrew Gelman, and me.

What is our situation?

When data is a representative sample from the population of interest, life is peachy. Tragically, this never happens. 

Maybe a less exciting way to say that would be that your sample is representative of a population, but it might not be an interesting population. An example of this would be a psychology experiment where the population is mostly psychology undergraduates at the PI’s university. The data can make reasonable conclusions about this population (assuming sufficient sample size and decent design etc), but this may not be a particularly interesting population for people outside of the PI’s lab. Lauren and Andrew have a really great paper about this!

It’s also possible that the population that is being represented by the data is difficult to quantify.  For instance, what is the population that an opt-in online survey generalizes to?

Moreover, it’s very possible that the strata of the population have been unevenly sampled on purpose. Why would someone visit such violence upon their statistical inference? There are many many reasons, but one of the big one is ensuring that you get enough samples from a rare population that’s of particular interest to the study. Even though there are good reasons to do this, it can still bork your statistical analysis.

All and all, dealing with non-representative data is a difficult thing and it will surprise exactly no one to hear that there are a whole pile of approaches that have been proposed from the middle of last century onwards.

Maybe we can weight it

Maybe the simplest method for dealing with non-representative data is to use sample weights. The purest form of this idea occurs when the population is stratified into J subgroups of interest and data is drawn independently at random from the jth population with probability \pi_j.  From this data it is easy to compute the sample average for each subgroup, which we will call \bar{y}_j. But how do we get an estimate of the population average from this?

Well just taking the average of the averages probably won’t work–if one of the subgroups has a different average from the others it’s going to give you the wrong answer.  The correct answer, aka the one that gives an unbiased estimate of the mean, was derived by Horvitz and Thompson in the early 1950s. To get an unbiased estimate of the mean you need to use the subgroup means and the sampling probabilities.  The Horvitz-Thompson estimator has the form 


Now, it is a truth universally acknowledged, if perhaps not universally understood, that unbiasedness is really only a meaningful thing if a lot of other things are going very well in your inference. In this case, it really only holds if the data was sampled from the population with the given probabilities.  Most of the time that doesn’t really happen. One of the problems is non-response bias, which (as you can maybe infer from the name) is the bias induced by non-response. 

(There are ways through this, like raking, but I’m not going to talk about those today).

Poststratification: flipping the problem on its head

One way to think about poststratification is that instead of making assumptions about how the observed sample was produced from the population, we make assumptions about how the observed sample can be used to reconstruct the rest of the population.  We then use this reconstructed population to estimate the population quantities of interest (like the population mean).

The advantage of this viewpoint is that we are very good at prediction. It is one of the fundamental problems in statistics (and machine learning because why not). This viewpoint also suggests that our target may not necessarily be unbiasedness but rather good prediction of the population. It also suggests that, if we can stomach a little bias, we can get much tighter estimates of the population quantity than survey weights can give. That is, we can trade of bias against variance!

Of course, anyone who tells you they’re doing assumption free inference is a dirty liar, and the fewer assumptions we have the more desperately we cling to them. (Beware the almost assumption-free inference. There be monsters!) So let’s talk about the two giant assumptions that we are going to make in order for this to work.

Giant assumption 1: We know the composition of our population. In order to reconstruct the population from the sample, we need to know how many people or things should be in each subgroup. This means that we are restricted in how we can stratify the population. For surveys of people, we typically build out our population information from census data, as well as from smaller official surveys like the American Community Survey (for estimation things about the US! The ACS is less useful in Belgium.).  (This assumption can be relaxed somewhat by clever people like Lauren and Andrew, but poststratifying to a variable that isn’t known in the population is definitely an adanced skill.)

Giant assumption 2: The people who didn’t answer the survey are like the people who did answer the survey. There are a few ways to formalize this, but one that is clear for me is that we need two things. First, that the people who were asked to participate in the survey in subgroup j is a random sample of subgroup j. The second thing we need is that the people who actually answered the survey in subgroup j is a random sample of the people who were asked.  These sort of missing at random or missing completely at random or ignorability assumptions are pretty much impossible to verify in practice. There are various clever things you can do to relax some of them (e.g. throw a hand full of salt over your left shoulder and whisper “causality” into a mouldy tube sock found under a teenage boy’s bed), but for the most part this is the assumption that we are making.

A thing that I hadn’t really appreciated until recently is that this also gives us some way to do model assessment and checking.  There are two ways we can do this. Firstly we can treat the observed data as the full population and fit our model to a random subsample and use that to assess the fit by estimating the population quantity of interest (like the mean). The second method is to assess how well the prediction works on left out data in each subgroup. This is useful because poststratification explicitly estimates the response in the unobserved population, so how good the predictions are (in each subgroup!) is a good thing to know!

This means that tools like LOO-CV are still useful, although rather than looking at a global LOO-elpd score, it would be more useful to look at it for each unique combination of stratifying variables. That said, we have a lot more work to do on model choice for survey data.

So if we have a way to predict the responses for the unobserved members of the population, we make estimates based on non-representative samples. So how do we do this prediction.  

Enter Mister P

Mister P (or MRP) is a grand old dame. Since Andrew and Thomas Little  introduced it in the mid-90s, a whole lot of hay has been made from the technique. It stands for Multilevel Regression and Poststratification and it kinda does what it says on the box. 

It uses multilevel regression to predict what unobserved data in each subgroup would look like, and then uses poststratification to fill in the rest of the population values and make predictions about the quantities of interest.

(This is a touch misleading. What it does is estimate the distribution of each subgroup mean and then uses poststratification to turn these into an estimate the distribution of the mean for the whole population. Mathematically it’s the same thing, but it’s much more convenient than filling in each response in the population.)

And there is scads of literature suggesting that this approach works very well. Especially if the multilevel structure and the group-level predictors are chosen well. 

But no method is perfect and in our paper we launch at one possible corner of the framework that can be improved. In particular, we look at the effect that using structured priors within the multilevel regression will have on the poststratified estimates. These changes turn out not to massively change whole population quantities, but can greatly improve the estimates within subpopulations.

What are the challenges with using multilevel regression in this context?

The standard formulation of Mister P treats each stratifying variable the same (allowing for a varying intercept and maybe some group-specific effects). But maybe not all stratifying variables are created equal.  (But all stratifying variables will be discrete because it is not the season for suffering. November is the season for suffering.)

Demographic variables like gender or race/ethnicity have a number of levels that are more or less exchangeable. Exchangeability has a technical definition, but one way to think about it is that a priori we think that the size of the effect of a particular gender on the response has the same distribution as the size of the effect of another gender on the response (perhaps after conditioning on some things).

From a modelling perspective, we can codify this as making the effect of each level of the demographic variable a different independent draw from the same normal distribution. 

In this setup, information is shared between different levels of the demographic variable because we don’t know what the mean and standard deviation of the normal distribution will be. These parameters are (roughly) estimated using information from the overall effect of that variable (total pooling) and from the variability of the effects estimated independently for each group (no pooling). 

But this doesn’t necessarily make sense for every type of demographic variable. One example that we used in the paper is age, where it may make more sense to pool information more strongly from nearby age groups than from distant age groups. A different example would be something like state, where it may make sense to pool information from nearby states rather from the whole country.

We can incorporate this type of structured pooling using what we call structured priors in the multilevel model. Structured priors are everywhere: Gaussian processes, time series models (like AR(1) models), conditional autogregressive (CAR) models, random walk priors, and smoothing splines are all commonly used examples.

But just because you can do something doesn’t mean you should. This leads to the question that inspired this work:

When do structured priors help MRP?

Structured priors typically lead to more complex models than the iid varying intercept model that a standard application of the MRP methodology uses. This extra complexity means that our we have more space to achieve our goal of predicting the unobserved survey responses. 

But as the great sages say: with low power comes great responsibility. 

If the sample size is small or if the priors are set wrong, this extra flexibility can lead to high-variance predictions and will lead to worse estimation of the quantities of interest. So we need to be careful.

As much as I want it to, this isn’t going to turn into a(nother) blog post about priors. But it’s worth thinking about. I’ve written about it at length before and will write about it at length again. (Also there’s the wiki!)

But to get back to the question, the answer depends on how we want to pool information. In a standard multilevel model, we augment the information within subgroup with the whole population information.  For instance, if we are estimating a mean and we have one varying intercept, it’s a tedious algebra exercise to show that 

\mathbb{E}(\mu_j \mid y)=\approx\frac{\frac{n_j}{\sigma^2} \bar{y}_j+\frac{1}{\tau^2}\bar{y}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}},

so we’ve borrowed some extra information from the raw mean of the data \bar{y} to augment the local means \bar{y}_j when they don’t have enough information.

But if our population is severely unbalanced and the different groups have vastly different different responses, this type of pooling may not be appropriate. 

A canny ready might say “well what if we put weights in so we can shrink to a better estimate of the population mean?”. Well that turns out to be very difficult. 

Everybody needs good neighbours (especially when millennials don’t answer the phone)

The solution we went with was to use a random walk prior on the age. This type of prior prioritizes pooling to nearby age categories.  We found that this makes a massive difference to the subpopulation estimates, especially when some age groups are less likely to answer the phone than others.

We put this all together into a detailed simulation study that showed that you can get some real advantages to doing this!

We also used this technique to analyze some phone survey data from The Annenberg Public Policy Center of the University of Pennsylvania about popular support for marriage equality in 2008. This example was chosen because, even in 2008, young people had a tendency not to answer their phones. Moreover, we expect the support for marriage equality to be different among different age groups.  Things went well. 

How to bin ordinal variables (don’t!)

One of the advantages of our strategy is that we can treat variables like age at their natural resolution (eg year) while modelling, and then predict the distribution of the responses in an aggregated category where we have enough demographic information to do poststratification. 

This breaks an awkward dependence between modelling choices and the assumptions needed to do poststratification. 

Things that are still to be done!

No paper is complete, so there are a few things we think are worth looking at now that we know that this type of strategy works.

  • Model selection: How can you tell which structure is best?
  • Prior choice: Always an issue!
  • Interactions: Some work has been done on using BART with MRP (they call it … BARP). This should cover interaction modelling, but doesn’t really allow for the types of structured modelling we’re using in this paper.
  • Different structures: In this paper, we used an AR(1) model and a second order random walk  model (basically a spline!). Other options include spatial models and Gaussian process models. We expect them to work the same way.

What’s in a name? (AKA the tl;dr)

I (and really no one else) really wants to call this Ms P, which would stand for Multilevel Structured regression with Poststratification.

But regardless of name, the big lesson of this paper are:

  1. Using structured priors allow us to pool information in a more problem appropriate way than standard multilevel models do, especially when stratifying our population according to an ordinal or spatial variable. 
  2. Structured priors are especially useful when one of the stratifying variable is ordinal (like age) and the response is expected depend (possibly non-linearly) with this variable.
  3. The gain from using structured priors increases when certain levels of the ordinal stratifying variable are over- or under-sampled. (Eg if young people stop answering phone surveys.)

So go forth and introduce yourself to Ms P. You’ll like her.

You should (usually) log transform your positive data

The reason for log transforming your data is not to deal with skewness or to get closer to a normal distribution; that’s rarely what we care about. Validity, additivity, and linearity are typically much more important.

The reason for log transformation is in many settings it should make additive and linear models make more sense. A multiplicative model on the original scale corresponds to an additive model on the log scale. For example, a treatment that increases prices by 2%, rather than a treatment that increases prices by $20. The log transformation is particularly relevant when the data vary a lot on the relative scale. Increasing prices by 2% has a much different dollar effect for a $10 item than a $1000 item. This example also gives some sense of why a log transformation won’t be perfect either, and ultimately you can fit whatever sort of model you want—but, as I said, in most cases I’ve of positive data, the log transformation is a natural starting point.

The above is all background; it’s stuff that we’ve all said many times before.

What’s new to me is this story from Shravan Vasishth:

You’re routinely being cited as endorsing the idea that model assumptions like normality are the least important of all in a linear model:

This statement of yours is not meant to be a recommendation to NHST users. But it is being misused by psychologists and psycholinguists in the NHST context to justify analyzing untransformed all-positive dependent variables and then making binary decisions based on p-values. Could you clarify your point in the next edition of your book?

I just reviewed a paper in JML (where we published our statistical significance filter paper) by some psychologists that insist that all data be analyzed using untransformed reaction/reading times. They don’t cite you there, but threads like the one above do keep citing you in the NHST context. I know that on p 15 of Gelman and Hill you say that it is often helpful to log transform all-positive data, but people selectively cite this other comment in your book to justify not transforming.

There are data-sets where 3 out of 547 data points drive the entire p<0.05 effect. With a log transform there would be nothing to claim and indeed that claim is not replicable. I discuss that particular example here.

I responded that (a) I hate twitter, and (b) In the book we discuss the importance of transformations in bringing the data closer to a linear and additive model.

Shravan threw it back at me:

The problem in this case is not really twitter, in my opinion, but the fact that people . . . read more into your comments than you intended, I suspect. What bothers me is that they cite Gelman as endorsing not ever log-transforming all-positive data, citing that one comment in the book out of context. This is not the first time I saw the Gelman and Hill quote being used. I have seen it in journal reviews in which reviewers insisted I analyze data on the untransformed values.

I replied that is really strange given that in the book we explicitly discuss log transformation.

From page 59:

It commonly makes sense to take the logarithm of outcomes that are all-positive.

From page 65:

If a variable has a narrow dynamic range (that is, if the ratio between the high and low values is close to 1), then it will not make much of a difference in fit if the regression is on the logarithmic or the original scale. . . . In such a situation, it might seem to make sense to stay on the original scale for reasons of simplicity. However, the logarithmic transformation can make sense even here, because coefficients are often more easily understood on the log scale. . . . For an input with a larger amount of relative variation (for example, heights of children, or weights of animals), it would make sense to work with its logarithm immediately, both as an aid in interpretation and likely an improvement in fit too.

Are there really people going around saying that we endorse not ever log-transforming all-positive data? That’s really weird.

Apparently, the answer is yes. According to Shravan, people are aggressively arguing for not log-transforming.

That’s just wack.

Log transform, kids. And don’t listen to people who tell you otherwise.

Did that “bottomless soup bowl” experiment ever happen?

I’m trying to figure out if Brian “Pizzagate” Wansink’s famous “bottomless soup bowl” experiment really happened.

Way back when, everybody thought the experiment was real. After all, it was described in a peer-reviewed journal article.

Here’s my friend Seth Roberts in 2006:

An experiment in which people eat soup from a bottomless bowl? Classic! Or mythological: American Sisyphus. It really happened.

And here’s econ professor Richard Thaler and law professor Cass Sunstein in 2008:

Given that they described this experiment as a “masterpiece,” I assume they thought it was real.

Evidence that the experiment never happened

We’ve known for awhile that some of the numbers in the Wansink et al. “Bottomless bowls” article were fabricated, or altered, or mis-typed, or mis-described, or something. Here’s James Heathers with lots of details.

But I’d just assumed there really had been such an experiment . . . until I encountered two recent blog comments by Jim and Mary expressing skepticism:

For me, for sure, if I got a 6oz soup bowl that refilled itself without me knowing I’d just go right on eating gallon after gallon of soup, never noticing. . . . There’s no way he even did that! That has to be a complete fabrication.

If you try to imagine designing the refilling soup bowl, it gets harder and harder the more you think about it. The soup has to be entering the bowl at exactly the right rate. . . . I don’t think they really did this experiment. They got as far as making the bowls and stuff, but then it was too hard to get it to work, and they gave up. This would explain why an experimental design with 2 bottomless and 2 non-bottomless subjects per table ended up with 23 controls and 31 manipulations . . .

I searched the internet and found a photo of the refilling soup bowl. Go to 2:36 at this video.

See also this video with actors (Cornell students, perhaps?) which purports to demonstrate how the bowl could be set up in a restaurant. The video is obviously fake so it doesn’t give me any sense of how they could’ve done it in real life.

I also found this video where Wansink demonstrates the refilling bowl. But this bowl, unlike the one in the previous demonstration, is attached to the table so I don’t see how it could ever be delivered to someone sitting at a restaurant.

So when you look at it that way: an absurdly complicated apparatus, videos that purport to be reconstructions but which lack plausibility, and no evidence of any real data . . . If seems that the whole thing could be a fake, that there was no experiment after all. Maybe they built the damn thing, tried it out on some real students, it didn’t work, and then they made up some summary statistics to put in the article. Or they did the experiment in some other way—for example, just giving some people more soup than others, with the experimentalists rationalizing it to themselves that this was essentially equivalent to that bottomless-bowl apparatus—and then fudged the data at the end to get statistically significant and publishable results.

Or maybe it all happened as described, and someone just mistyped a bunch of numbers which is why the values in the published paper didn’t add up.

To paraphrase Jordan Anaya: I dunno. If I’d just designed and carried out the most awesome experiment of my career—a design that some might call a “masterpiece”—I think I’d be pretty damn careful with the data that resulted. I’d’ve made something like 50 copies of the dataset to make sure it never got lost, and I’d triple-check all my analyses to make sure I didn’t make any mistakes. I might even bring in two trusted coauthors just to be 100% sure that there were no missteps. I wouldn’t want to ruin this masterpiece.

It’s as if Wansink had found some rare and expensive crystal goblet and then threw it in the back of a pickup truck to bring it home. A complete disconnect between the huge effort required to purportedly collect the data, and the zero or negative effort expended on making sure the data didn’t get garbled or destroyed.

Evidence that the experiment did happen

On the other hand . . .

Perhaps the strongest argument in favor of the experiment being real is that there were three authors on that published paper. So if the whole thing was mde up, it wouldn’t be just Brian Wansink doing the lying, it would also be James Painter and Jill North. That moves our speculation into the conspiracy category.

That said, we don’t know how the project was conducted. It might be that Wansink took responsibility for the data collection, and Painter and North were involved before and after and just took Wansink’s word for it that the experiment was actually done. Or maybe there is some other possibility.

Another piece of evidence in favor of the experiment being real is that Wansink and his colleagues put a lot of effort into explaining how the bowl worked. There are three paragraphs in Wansink et al. (2005) describing how they constructed the apparatus, how it worked, and how they operated it. Wansink also devotes a few pages of his book, Mindless Eating, to the soup experiment, providing further details; for example:

Our bottomless bowls failed to function during the first practice trial. The chicken noodle soup we were using either clogged the tubes or caused the soup to gurgle strangely. We bought 360 quarts of Campbells tomato soup, and started over.

I’m kinda surprised they ever thought the refilling bowl would work with chicken noodle soup—isn’t it obvious that it would clog the tube or clump in some way?—but, hey, dude’s a b-school professor, not a physicist, I guess we should cut him some slack.

Scrolling through the Mindless Eating on Amazon, I also came across this:

It seems that when estimating almost anything—such as weight, height, brightness, loudness, sweetness, and so on—we consistently underestimate things as they get larger. For instance, we’ll be fairly accurate at estimating the weight of a 2-pound rock but will grossly underestimate the weight of an 80-pound rock. . . .

They’re having people lift 80-pound rocks? That’s pretty heavy! I wonder what the experimental protocol for that is. (I guess they could ask people to estimate the weight of the rock by just looking at it, but that would be tough for lots of reasons.)

But I digress. To return to the soup experiment, Wansink also provides this story about one of the few people who had to be excluded from the data:

Cool story, huh? Not quite consistent with the published paper, which simply said that 54 participants were recruited for the study, but at least some recognition that moving the soup bowl could create a problem.


Did the experiment ever happen? I just don’t know! I see good arguments on both sides.

I can tell you one thing, though. Whether or not Wansink’s apparatus ever made its way out of the lab, it seems that the “bottomless soup bowl” has been used in at least one real experiment. I found this paper from 2012, Episodic Memory and Appetite Regulation in Humans, by Jeffrey Brunstrom et al., which explains:

Soup was added or removed from a transparent soup bowl using a peristaltic pump (see Figure 1). The soup bowl was presented in front of the volunteers and it was fixed to a table. A tall screen was positioned at the back of the table. This separated the participant from both the experimenter and a second table, supporting the pump and a soup reservoir. Throughout the experiment, the volunteers were unable to see beyond the screen.

The bottom of the soup bowl was connected to a length of temperature-insulated food-grade tubing. This connection was hidden from the participants using a tablecloth. The tubing fed through a hole in the table (immediately under the bowl) and connected to the pump and then to a reservoir of soup via a hole in the screen. The experimenter was able to manipulate the direction and rate of flow using an adjustable motor controller that was attached to the pump. The pre-heated soup was ‘creamed tomato soup’ (supplied by Sainsbury’s Supermarkets Ltd., London; 38 kcal/100 g).


Participants were then taken to a testing booth where a bowl of soup was waiting. They were instructed to avoid touching the bowl and to eat until the volume of soup remaining matched a line on the side of the bowl. The line ensured that eating terminated with 100 ml of soup remaining, thereby obscuring the bottom of the bowl.

So it does seem like the bottomless soup bowl experiment is possible, if done carefully. The above-linked article by Brunstrum et al. seems completely real. If it’s a fake, it’s fooled me! If it’s real, and Wansink et al. (2005) was fake, then this is a fascinating case of a real-life replication of a nonexistent study. Kind of like if someone were to breed a unicorn.

“The issue of how to report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly.”

Ricardo Vieira writes:

I recently came upon this study from Princeton published in PNAS:

Implicit model of other people’s visual attention as an invisible, force-carrying beam projecting from the eyes

In which the authors asked people to demonstrate how much you have to tilt an object before it falls. They show that when a human head is looking at the object in the direction that it is tilting, people implicitly rate the tipping point as being lower than when a person is looking in the opposite direction (as if the eyes either pushed the object down or prevented it from falling). They further show that no such difference emerges when the human head is blindfolded. The experiment a few times with different populations (online and local) and slight modifications.

In a subsequent survey, they found that actually 5% of the population seems to believe in some form of eye-beams (or extramission if you want to be technical).

I have a few issues with the article. For starters, they do not compare directly the non-blindfolded and blindfolded conditions, although they emphasize that the difference in the first is significant and in the second is not several times. This point was actually brought up in the blog Neuroskeptic. The author of the blog writes:

This study seems fairly solid, although it seems a little fortuitious that the small effect found by the n=157 Experiment 1 was replicated in the much smaller (and hence surely underpowered) follow-up experiments 2 and 3C. I also think the stats are affected by the old erroneous analysis of interactions error (i.e. failure to test the difference between conditions directly) although I’m not sure if this makes much difference here.

In the discussion that ensued, one of the study authors responds to the two points raised. I feel the first point is not that relevant, as the first experiment was done on mturk and the subsequent ones in a controlled lab, and the estimated standard errors were pretty similar across the board. Now on to the second point, the author writes:

The issue of how the report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly. First, it should be noted that each of the bars shown in the figure is already a difference between two means (mean angular tilt toward the face vs. mean angular tilt away from the face), not itself a raw mean. What we report, in each case, is a statistical test on a difference between means. If I interpret your argument correctly, it suggests that the critical comparison for us is not this tilt difference itself, but the difference of tilt differences. In our study, however, I would argue that this is not the case, for a couple of reasons:

In experiment 1 (a similar logic applies to exp 2), we explicitly spelled out two hypotheses. The first is that, when the eyes are open, there should be a significant difference between tilts toward the face and tilts away from the face. A significant different here would be consistent with a perceived force emanating from the eyes. Hence, we performed a specific, within-subjects comparison between means to test that specific hypothesis. Doing away with that specific comparison would remove the critical statistical test. Our main prediction would remain unexamined. Note that we carefully organized the text to lay out this hypothesis and report the statistics that confirm the prediction. The second hypothesis is that, when the eyes are closed, there should be no significant difference between tilts toward the face and tilts away from the face (null hypothesis). We performed this specific comparison as well. Indeed, we found no statistical evidence of a tilt effect when the eyes were closed. Thus, each hypothesis was put to statistical test. One could test a third hypothesis: any tilt difference effect is bigger when the eyes are open than when the eyes are closed. I think this is the difference of tilt differences asked for. However, this is not a hypothesis we put forward. We were very careful not to frame the paper in that way. The reason is that this hypothesis (this difference of differences) could be fulfilled in many ways. One could imagine a data set in which, when the eyes are open, the tilt effect is not by itself significant, but shows a small positivity; and when the eyes are closed, the tilt effect shows a small negativity. The combination could yield a significant difference of differences. The proposed test would then provide a false positive, showing a significant effect while the data actually do not support our hypotheses.

Of course, one could ask: why not include both comparisons, reporting on the tests we did as well as the difference of differences? There are at least two reasons. First, if we added more tests, such as the difference of differences, along with the tests we already reported, then we would be double-dipping, or overlapping statistical tests on the same data. The tests then become partially redundant and do not represent independent confirmation of anything. Second, as easy as it may sound, the difference-of-differences is not even calculatable in a consistent manner across all four experiments (e.g., in the control experiment 4), and so it does not provide a standardized way to evaluate all the results.

For all of these reasons, we believe the specific statistical methods reported in the manuscript are the simplest and the most valid. I totally understand that our statistics may seem to be affected by the erroneous analysis of interactions error, at first glance. But on deeper consideration, analyzing the difference-of-differences turns out to be somewhat problematical and also not calculatable for some of our data sets.

Is this reasonable?

My other issues relates to the actual effect. First the size of the difference is not clear (average difference is around 0.67 degrees, which are never described in terms of visual angle). I tried to draw two lines separated by 0.67 degrees on, and I couldn’t tell the difference unless they were superimposed, but I am not sure I got the scale correct. Second, they do not state in the article how much rotation is caused by each key-press (is this average difference equivalent to one key-press, half, two?). Finally, the participants do not see the full object rendered during the experiment, but just one vertical line. The authors argue that otherwise people would use heuristics such as move the top corner over the opposite bottom corner. This necessity seems to refute their hypothesis (if the eye-beam bias only work on lines, than they seem of little relevance to the 3d world).

Okay, perhaps what really bothers me is the last paragraph of the article:

We speculate that an automatic, implicit model of vision as a beam exiting the eyes might help to explain a wide range of cultural myths and associations. For example, in StarWars, a Jedi master can move an object by staring at it and concentrating the mind. The movie franchise works with audiences because it resonates with natural biases. Superman has beams that can emanate from his eyes and burn holes. We refer to the light of love and the light of recognition in someone’s eyes, and we refer to death as the moment when light leaves the eyes. We refer to the feeling of someone else’s gaze boring into us. Our culture is suffused with metaphors, stories, and associations about eye beams. The present data suggest that these cultural associations may be more than a simple mistake. Eye beams may remain embedded in the culture, 1,000 y after Ibn al-Haytham established the correct laws of optics (12), because they resonate with a deeper, automatic model constructed by our social machinery. The myth of extramission may tell us something about who we are as social animals.

Before getting to the details, let me share my first reaction, which is appreciation that Arvid Guterstam, one of the authors of the published paper, engaged directly with external criticism, rather than ignoring the criticism, dodging it, or attacking the messenger.

Second, let me emphasize the distinction between individuals and averages. In the above-linked post, Neuroskeptic writes:

Do you believe that people’s eyes emit an invisible beam of force?

According to a rather fun paper in PNAS, you probably do, on some level, believe that.

And indeed, the abstract of the article states: “when people judge the mechanical forces acting on an object, their judgments are biased by another person gazing at the object.” But this finding (to the extent that it’s real, in the sense of being something that would show up in a large study of the general population under realistic conditions) is a finding about averages. It could be that everyone behaves this way, or that most people behave this way, or that only some people behave this way: any of these can be consistent with an average difference.

Also Neuroskeptic’s summary takes a little poetic license, in that the study does not claim that most people believe that eyes emit any force; the claim is that people on average make certain judgments as if eyes emit that force.

This last bit is no big deal but I bring it up because there’s a big difference between people believing in the eye-beam force and implicitly reacting as if there was such a force. The latter can be some sort of cognitive processing bias, analogous in some ways to familiar visual and cognitive illusions that persist even if they are explained to you.

Now on to Vieira’s original question: did the original authors do the right thing in comparing significant to not significant? No, what they did was mistaken, for the usual reasons.

The author’s explanation quoted above is wrong, I believe in an instructive way. The author talks a lot about hypotheses and a bit about the framing of the data, but that’s not so relevant to the question of what can we learn from the data. Procedural discussions such as “double-dipping” also miss the point: Again, what we should want to know is what can be learned from these data (plus whatever assumptions go into the analysis), not how many times the authors “dipped” or whatever.

The fundamental fallacy I see in the authors’ original analysis, and in their follow-up explanation, is deterministic reasoning, in particular the idea that whether a comparison is “statistically significant” is equivalent to an effect being real.

Consider this snippet from Guterstam’s comment:

The second hypothesis is that, when the eyes are closed, there should be no significant difference between tilts toward the face and tilts away from the face (null hypothesis).

This is an error. A hypothesis should not be about statistical significance (or, in this case, no significant difference) in the data; it should be about the underlying or population pattern.

And this:

One could imagine a data set in which, when the eyes are open, the tilt effect is not by itself significant, but shows a small positivity; and when the eyes are closed, the tilt effect shows a small negativity. The combination could yield a significant difference of differences. The proposed test would then provide a false positive, showing a significant effect while the data actually do not support our hypotheses.

Again, the problem here is the blurring of two different things: (a) underlying effects and (b) statistically significant patterns in the data.

A big problem

The error of comparing statistical significance to non-significance is a little thing.

A bigger mistake is the deterministic attitude by which effects are considered there or not, the whole “false positive / false negative” thing. Lots of people, I expect most statisticians, don’t see this as a mistake, but it is one.

But an even bigger problem comes in this sentence from the author of the paper in question:

The issue of how the report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly.

He’s “quite sure”—but he’s wrong. This is a big, big, big problem. People are so so so sure of themselves.

Look. This guy could well be an excellent scientist. He has a Ph.D. He’s a neuroscientist. He knows a lot of stuff I don’t know. But maybe he’s not a statistics expert. That’s ok—not everyone should be a statistics expert. Division of labor! But a key part of doing good work is to have a sense of what you don’t know.

Maybe don’t be so quite sure next time! It’s ok to get some things wrong. I get things wrong all the time. Indeed, one of the main reasons for publishing your work is to get it out there, so that readers can uncover your mistakes. As I said above, I very much appreciate that the author of this article responded constructively to criticism. I think it’s too bad he was so sure of himself on the statistics, but even that is a small thing compared to his openness to discussion.

I agree with my correspondent

Finally, I agree with Vieira that the last paragraph of the article (“We speculate that an automatic, implicit model of vision as a beam exiting the eyes might help to explain a wide range of cultural myths and associations. For example, in StarWars, a Jedi master can move an object by staring at it and concentrating the mind. The movie franchise works with audiences because it resonates with natural biases. Superman has beams that can emanate from his eyes and burn holes. We refer to the light of love and the light of recognition in someone’s eyes, and we refer to death as the moment when light leaves the eyes. We refer to the feeling of someone else’s gaze boring into us. Our culture is suffused with metaphors, stories, and associations about eye beams. The present data suggest that these cultural associations may be more than a simple mistake. Eye beams may remain embedded in the culture, 1,000 y after Ibn al-Haytham established the correct laws of optics (12), because they resonate with a deeper, automatic model constructed by our social machinery. The myth of extramission may tell us something about who we are as social animals.”) is waaaay over the top. I mean, sure, who knows, but, yeah, this is story time outta control!

P.S. One amusing feature of this episode is that the above-linked comment thread has some commenters who seem to actually believe that eye-beams are real:

If “eye beam” is the proper term then I have no difficulty in registering my belief in them. Any habitué of the subway is familiar with the mysterious effect where looking at another’s face, who may be reading a book or be absorbed in his phone, maybe 20 or 30 feet away, will cause him suddenly to swivel his glance toward the onlooker. Let any who doubt experiment.

Just ask hunters or bird watchers if they exist. They know never to look directly at the animals head/eyes or they will be spooked.

I have had my arse saved by ‘sensing’ the gaze of others. This ‘effect’ is real. Completely subjective…yes. That I am here and able to write this comment…is a fact.

No surprise, I guess. There are lots of supernatural beliefs floating around, and it makes sense that they should show up all over, including on blog comment threads.

“I feel like the really solid information therein comes from non or negative correlations”

Steve Roth writes:

I’d love to hear your thoughts on this approach (heavily inspired by Arindrajit Dube’s work, linked therein):

This relates to our discussion from 2014:

My biggest takeaway from this latest: I feel like the really solid information therein comes from non or negative correlations:

• It comes before
• But it doesn’t correlate with ensuing (or it correlates negatively)

It’s pretty darned certain it isn’t caused by.

If smoking didn’t correlate with ensuing lung cancer (or correlated negatively), we’d say with pretty strong certainty that smoking doesn’t cause cancer, right?

By contrast, positive correlation only tells us that something (out of an infinity of explanations) might be causing the apparent effect of A on B. Non or negative correlation strongly disproves a hypothesis.

I’m less confident saying: if we don’t look at multiple positive and negative time lags for time series correlations, we don’t really learn anything from them?

More generally, this is basic Popper/science/falsification. The depressing takeaway: all we can really do with correlation analysis is disprove an infinite set of hypotheses, one at a time? Hoping that eventually we’ll gain confidence in the non-disproved causal hypotheses? Slow work!

It also suggests that file-drawer bias is far more pernicious than is generally allowed. The institutional incentives actually suppress the most useful, convincing findings? Disproofs?

(This all toward my somewhat obsessive economic interests: does wealth concentration/inequality cause slower economic growth one year, five years, twenty years later? The data’s still sparse…)

Roth summarizes:

“Dispositive” findings are literally non-positive. They dispose of hypotheses.

My reply:

1. The general point reminds me of my dictum that statistical hypothesis testing works the opposite way that people think it does. The usual thinking is that if a hyp test rejects, you’ve learned something, but if the test does not reject, you can’t say anything. I’d say it’s the opposite: if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.

2. That said, what you write can’t be literally true. Zero or nonzero correlations don’t stay zero or nonzero after you control for other variables. For example, if smoking didn’t correlate with lung cancer in observational data, sure, that would be a surprise, but in any case you’d have to look at other differences between the exposed and unexposed groups.

3. As a side remark, just reacting to something at the end of the your email, I continue to think that file drawer is overrated, given the huge number of researcher degrees of freedom, even in many preregistered studies (for example here). Researchers have no need to bury non-findings in the file drawer; instead they can extract findings of interest from just about any dataset.

What can be learned from this study?

James Coyne writes:

A recent article co-authored by a leading mindfulness researcher claims to address the problems that plague meditation research, namely, underpowered studies; lack of or meaningful control groups; and an exclusive reliance on subjective self-report measures, rather than measures of the biological substrate that could establish possible mechanisms.

The article claims adequate sample size, includes two active meditation groups and three control groups and relies on seemingly sophisticated strategy for statistical analysis. What could possibly go wrong?

I think the study is underpowered to detect meaningful differences between active treatment and control groups. The authors haven’t thought out precisely how to use the presence of multiple control groups. They rely on statistical significance is the criterion for the value of the meditation groups. But when it comes to a reckoning, they avoid inevitably nonsignificant results that would occur comparisons of changes over time inactive versus control groups. Instead they substitute with group analyses and peer whether the results are significant for active treatments, but not control groups.

The article does not present power analyses but simply states that “a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007).

There are five groups, representing two active treatments and three control groups. That means that all the relevant action depends on group by time interaction effects in pairs of active treatment and control groups, with 27 participants in each cell.

I have seen a lot of clinical trials in psychological interventions, but never one with two active treatments and three control groups. In the abstract it may seem interesting, but I have no idea what research questions would be answered by this constellation. I can’t even imagine planned comparisons that would follow up on overall treatment (5) by time interaction effect

The analytic strategy was to examine whether there is an overall group by time interaction effect and then at examine within-group, pre/post differences for particular variables. When these within group differences are statistically significant for an active treatment group, but not for the control groups, it is considered a confirmation hypothesis that meditation is effective with respect to certain variables.

When there are within-differences for both psychological and biological variables, it is inferred that the evidence is consistent with the biological statement he psychological changes.

There are then mediational analysis that follow a standard procedure: construction of zero order correlation matrix; calculation of residual change scores for each individual with creation of dummy variables for four of the groups contrasted against the mutual control group. Simple mediation effects were then calculated for each psychological self-report variable with group assignment as the predictor variable and physiological variable as the moderator.

I think these mediational analyses are a wasted effort because of the small number of subjects exposed to each intervention.

At this point I would usually read the article, perhaps make some calculations, read some related things, figure out my general conclusions, and then write everything up.

This time I decided to do something different and respond in real time.

So I’ll give my response, labeling each step.

1. First impressions

The article in question is Soothing Your Heart and Feeling Connected: A New Experimental Paradigm to Study the Benefits of Self-Compassion, by Hans Kirschner, Willem Kuyken, Kim Wright, Henrietta Roberts, Claire Brejcha, and Anke Karl, and it begins:

Self-compassion and its cultivation in psychological interventions are associated with improved mental health and well- being. However, the underlying processes for this are not well understood. We randomly assigned 135 participants to study the effect of two short-term self-compassion exercises on self-reported-state mood and psychophysiological responses compared to three control conditions of negative (rumination), neutral, and positive (excitement) valence. Increased self-reported-state self-compassion, affiliative affect, and decreased self-criticism were found after both self-compassion exercises and the positive-excitement condition. However, a psychophysiological response pattern of reduced arousal (reduced heart rate and skin conductance) and increased parasympathetic activation (increased heart rate variability) were unique to the self-compassion conditions. This pattern is associated with effective emotion regulation in times of adversity. As predicted, rumination triggered the opposite pattern across self-report and physiological responses. Furthermore, we found partial evidence that physiological arousal reduction and parasympathetic activation precede the experience of feeling safe and connected.

My correspondent’s concern was that the sample size was too small . . . let’s look at that part of the paper:

We recruited a total of 135 university students in the United Kingdom (27 per experimental condition . . .)

OK, so yes I’m concerned. 27 seems small, especially for a between-person design.

But is N really too small? It depends on effect size and variation.

Let’s look at the data.

Here are the basic data summaries:

I think these are averages: each dot is the average of 27 people.

The top four graphs are hard to interpret: I see there’s more variation after than before, but beyond that I’m not clear what to make of this.

So I’ll focus on the bottom three graphs, which have more data. The patterns seem pretty clear, and I expect there is a high correlation across time. I’d like to see the separate lines for each person. That last graph, of skin conductance level, is particularly striking in that the lines go up and then down in synchrony.

What’s the story here? Skin conductance seems like a clear enough outcome, even if not of direct interest it’s something that can be measured. The treatments, recall, were “two short-term self-compassion exercises” and “three control conditions of negative (rumination), neutral, and positive (excitement) valence.” I’m surprised to see such clear patterns from these treatments. I say this from a position of ignorance; just based on general impressions I would not have known to expect such consistency.

2. Data analysis

OK, now we seem to be going beyond first impressions . . .

So what data would I like to see to understand these results better? I like the graphs above, and now I want something more that focuses on treatment effects and differences between groups.

To start with, how about we summarize each person’s outcome by a single number. I’ll focus on the last three outcomes (e, f, g) shown above. Looking at the graphs, maybe we could summarize each by the average measurement during times 6 through 11. So, for each outcome, I want a scatterplot. Let y_i be person i’s average outcome during times 6 through 11, and x_i is the outcome at baseline. For each outcome, let’s plot y_i vs x_i. That’s a graph with 135 dots, you could use 5 colors, one for each treatment. Or maybe 5 different graphs, I’m not sure. There are three outcomes, so that’s 3 graphs or a 3 x 5 grid.

I’d also suggest averaging the three outcomes for each person so now there’s one total score. Standardize each score and reverse-code as appropriate (I guess that in this case we’d flip the sign of outcome f when adding up these three). This would be the clear summary we’d need.

I have the luxury of not needing to make a summary judgment on the conclusions, so I’ll just say that I’d like to see some scatterplots before going forward.

3. Other impressions

The paper gives a lot of numerical summaries of this sort:

The Group × Time ANOVA revealed no significant main effect of group, F(4,130) = 1.03, p > .05, ηp2 = .03. However, the Time × Group interaction yielded significance, F(4, 130) = 24.46, p < .001, ηp2 = .43. Post hoc analyses revealed that there was a significant pre-to-post increase in positive affiliative affect in the CBS condition, F(1, 26) = 10.53, p = .003, ηp2 = .28, 95% CI = [2.00, 8.93], the LKM-S condition, F(1, 26) = 26.79, p < .001, ηp2 = .51, 95% CI = [5.43, 12.59] and, albeit smaller, for the positive condition, F(1, 26) = 6.12, p = .020, ηp2 = .19, 95% CI = [0.69, 7.46]. In the rumination condition there was a significant decrease in positive affiliative affect after the manipulation, F(1, 26) = 38.90, p < .001, ηp2 = .60, 95% CI = [–18.79, –9.48], whereas no pre-to-post manipulation difference emerged for the control condition, F(1, 26) = .49, p = 486, ηp2 = .01, 95% CI = [–4.77, 2.33]. Interestingly, an ANCOVA (see Supplemental Material) revealed that after induction, only individuals in the LKM-S condition reported significantly higher positive affiliative affect than those in the neutral condition, and individuals in the rumination condition reported significantly lower positive affiliative affect.

This looks like word salad—or, should I say, number salad—and full of forking paths. Just a mess, as it’s some subset of all the many comparisons that could be performed. I know this sort of thing is standard data-analytic practice in many fields of research, so it’s not like this paper stands out in a bad way; still, I don’t find these summaries to be at all helpful. I’d rather do a multilevel model.

And then there’s this:

No way. I’m not even gonna bother with this.

The paper concludes with some speculations:

Short-term self-compassion exercises may exert their beneficial effect by temporarily activating a low-arousal parasympathetic positive affective system that has been associated with stress reduction, social affiliation, and effective emotion regulation

Short-term self-compassion exercises may exert their beneficial effect by temporarily increasing positive self and reducing negative self-bias, thus potentially addressing cognitive vulnerabilities for mental disorders

I appreciate that the authors clearly labeled these as speculations, possibilities, etc., and the paper’s final sentences were also tentative:

We conclude that self-compassion reduces negative self-bias and activates a content and calm state of mind with a disposition for kindness, care, social connectedness, and the ability to self-soothe when stressed. Our paradigm might serve as a basis for future research in analogue and patient studies addressing several important outstanding questions.

4. Was the sample size too small?

The authors write:

Although the sample size in this study was based on a priori power calculation for medium effect sizes in mixed measures ANOVAs and the recruitment target was met, a larger sample size may have been desirable. Overall, a sample of 135 is considered to be a good sample size for growth curve modeling (Curran, Obeidat, & Losardo, 2010) or mediation analyses for medium-to-large effects (Fritz & MacKinnon, 2007). However, some of the effects were small-to-medium rather than medium and failed to reach significance, and thus a replication in a larger sample is warranted to check the robustness of our effects.

This raises some red flags to me, as it’s been my impression that real-life effects in psychology experiments are typically much smaller than what are called “medium effect sizes” in the literature. Also I think the above paragraph reveals some misunderstanding about effect sizes in that the authors are essentially doing post-hoc power analysis, not recognizing the high variability in effect size estimates; for more background on this point, see here and here.

The other point I want to return to is the between-person design. Without any understanding of this particular subfield, I’d recommend a within-person study in the future, where you try multiple treatments on each person. If you’re worried about poisoning the well, you could do different treatments on different days.

Speaking more generally, I’d like to move the question away from sample size and toward questions of measurement. Beyond the suggestion to perform multiple treatments on each person, I’ll return to my correspondent’s questions at the top of this post, which I can’t really evaluate myself, not knowing enough about this area.

Amending Conquest’s Law to account for selection bias

Robert Conquest was a historian who published critical studies of the Soviet Union and whose famous “First Law” is, “Everybody is reactionary on subjects he knows about.” I did some searching on the internet, and the most authoritative source seems to be this quote from Conquest’s friend Kingsley Amis:

Further search led to this elaboration from philosopher Roger Scruton:

. . .

I agree with Scruton that we shouldn’t take the term “reactionary” (dictionary definition, “opposing political or social progress or reform”) too literally. Even Conquest, presumably, would not have objected to the law forbidding the employment of children as chimney sweeps.

The point of Conquest’s Law is that it’s easy to propose big changes in areas distant from you, but on the subjects you know about, you will respect tradition more, as you have more of an understanding of why it’s there. This makes sense, although I can also see the alternative argument that certain traditions might seem to make sense from a distance but are clearly absurd when looked at from close up. I guess it depends on the tradition.

In the realm of economics, for example, Engels, Keynes, and various others had a lot of direct experience of capitalism but it didn’t stop them from promoting revolution and reform. That said, Conquest’s Law makes sense and is clearly true in many cases, even if not always.

What motivated me to write this post, though, was not these sorts of rare exceptions—after all, most people who are successful in business are surely conservative, not radical, in their economic views—but rather an issue of selection bias.

Conquest was a successful academic and hung out with upper-class people, Oxbridge graduates, various people who were closer to the top than the bottom of the social ladder. From that perspective it’s perhaps no surprise that they were “reactionary” in their professional environments, as they were well ensconced there. This is not to deny the sincerity and relevance of such views, any more than we would want to deny the sincerity and relevance of radical views held by people with less exalted social positions. I’m sure the typical Ivy League professor such as myself is much more content and “reactionary” regarding the university system, then would be a debt-laden student or harried adjunct. I knew some people who worked for minimum wage at McDonalds, and I think their take on the institution was a bit less reactionary than that of the higher-ups. This doesn’t mean that people with radical views want to tear the whole thing down (after all, people teach classes, work at McDonalds, etc., out of their own free will), nor that reactionaries want no change. My only point here is that the results of a survey, even an informal survey, of attitudes will depend on who you think of asking.

It’s interesting how statistical principles can help us better understand even purely qualitative statements.

A similar issue arose with baseball analyst Bill James. As I wrote a few years ago:

In 2001, James wrote:

Are athletes special people? In general, no, but occasionally, yes. Johnny Pesky at 75 was trim, youthful, optimistic, and practically exploding with energy. You rarely meet anybody like that who isn’t an ex-athlete—and that makes athletes seem special.

I’ve met 75-year-olds like that, and none of them was an ex-athlete. That’s probably because I don’t know a lot of ex-athletes. But Bill James . . . he knows a lot of athletes. He went to the bathroom with Tim Raines once! The most I can say is that I saw Rickey Henderson steal a couple bases in a game against against the Orioles.

Cognitive psychologists talk about the base-rate fallacy, which is the mistake of estimating probabilities without accounting for underlying frequencies. Bill James knows a lot of ex-athletes, so it’s no surprise that the youthful, optimistic, 75-year-olds he meets are likely to be ex-athletes. The rest of us don’t know many ex-athletes, so it’s no surprise that most of the youthful, optimistic, 75-year-olds we meet are not ex-athletes. The mistake James made in the above quote was to write “You” when he really meant “I.” I’m not disputing his claim that athletes are disproportionately likely to become lively 75-year-olds; what I’m disagreeing with is his statement that almost all such people are ex-athletes. Yeah, I know, I’m being picky. But the point is important, I think, because of the window it offers into the larger issue of people being trapped in their own environments (the “availability heuristic,” in the jargon of cognitive psychology). Athletes loom large in Bill James’s world—I wouldn’t want it any other way—and sometimes he forgets that the rest of us live in a different world.

Another way to put it: Selection bias. Using a non-representative sample to drawing inappropriate inferences about the population.

This does not make Conquest’s or James’s observations valueless. We just have to interpret them carefully given the data, to get something like:

Conquest: People near the top of a hierarchy typically like it there.

James: I [James] know lots of energetic elderly athletes. Most of the elderly non-athletes I know are not energetic.

Why does my academic lab keep growing?

Andrew, Breck, and I are struggling with the Stan group funding at Columbia just like most small groups in academia. The short story is that to apply for enough grants to give us a decent chance of making payroll in the following year, we have to apply for so many that our expected amount of funding goes up. So our group keeps growing, putting even more pressure on us in the future to write more grants to make payroll. It’s a better kind of problem to have than firing people, but the snowball effect means a lot of work beyond what we’d like to be doing.

Why does my academic lab keep growing?

Here’s a simple analysis. For the sake of argument, let’s say your lab has a $1.5M annual budget. And to keep things simple, let’s suppose all grants are $0.5M. So you need three per year to keep the lab afloat. Let’s say you have a well-oiled grant machine with a 40% success rate on applications.

Now what happens if you apply for 8 grants? There’s roughly a 30% chance you get fewer than the 3 grants you need, a 30% chance you get exactly the 3 grants you need, and a 40% chance you get more grants than you need.

If you’re like us, a 30% chance of not making payroll is more than you’d like, so let’s say you apply for 10 grants. Now there’s only a 20% chance you won’t make payroll (still not great odds!), a 20% chance you get exactly 3 grants, and a whopping 60% chance you wind up with 4 or more grants.

The more conservative you are about making payroll, the bigger this problem is.

Wait and See?

It’s not quite as bad as that analysis leads one to believe, because once a lab’s rolling, it’s usually working in two-year chunks, not one-year chunks. But that takes a while to build up that critical mass.

It would be great if you could apply and wait and see before applying again, but it’s not so easy. Most government grants have fixed deadlines, typically once or at most twice per year. The ones like NIH that have two submission periods/year have a tendency to no fund first applications. So if you don’t apply in a cycle, it’s usually at least another year before you can apply again. Sometimes special one-time-only opportunities with partners or funding agencies come up. We also run into problems like government shutdowns—I still have two NSF grants under review that have been backed up forever (we’ve submitted and heard back on other grants from NSF in the meantime).

The situation with Stan at Columbia

We’ve received enough grants to keep us going. But we have a bunch more in process, some of which we’re cautiously optimistic about. And we’ve already received about half a grant more than we anticipated, so we’re going to have to hire even if we don’t get the ones in process.

So if you know any postdocs or others who might want to work on the Stan language in OCaml and C++, let me know ( A more formal job ad will be out out soon.

As always, I think the best solution is not for researchers to just report on some preregistered claim, but rather for them to display the entire multiverse of possible relevant results.

I happened to receive these two emails in the same day.

Russ Lyons pointed to this news article by Jocelyn Kaiser, “Major medical journals don’t follow their own rules for reporting results from clinical trials,” and Kevin Lewis pointed to this research article by Kevin Murphy and Herman Aguinis, “HARKing: How Badly Can Cherry-Picking and Question Trolling Produce Bias in Published Results?”

Both articles made good points. I just wanted to change the focus slightly, to move away from the researchers’ agency and to recognize the problem of passive selection, which is again why I like to speak of forking paths rather than p-hacking.

As always, I think the best solution is not for researchers to just report on some preregistered claim, but rather for them to display the entire multiverse of possible relevant results.

“Beyond ‘Treatment Versus Control’: How Bayesian Analysis Makes Factorial Experiments Feasible in Education Research”

Daniel Kassler, Ira Nichols-Barrer, and Mariel Finucane write:

Researchers often wish to test a large set of related interventions or approaches to implementation. A factorial experiment accomplishes this by examining not only basic treatment–control comparisons but also the effects of multiple implementation “factors” such as different dosages or implementation strategies and the interactions between these factor levels. However, traditional methods of statistical inference may require prohibitively large sample sizes to perform complex factorial experiments.

We present a Bayesian approach to factorial design. Through the use of hierarchical priors and partial pooling, we show how Bayesian analysis substantially increases the precision of estimates in complex experiments with many factors and factor levels, while controlling the risk of false positives from multiple comparisons.

Using an experiment we performed for the U.S. Department of Education as a motivating example, we perform power calculations for both classical and Bayesian methods. We repeatedly simulate factorial experiments with a variety of sample sizes and numbers of treatment arms to estimate the minimum detectable effect (MDE) for each combination.

The Bayesian approach yields substantially lower MDEs when compared with classical methods for complex factorial experiments. For example, to test 72 treatment arms (five factors with two or three levels each), a classical experiment requires nearly twice the sample size as a Bayesian experiment to obtain a given MDE.

They conclude:

Bayesian methods are a valuable tool for researchers interested in studying complex interventions. They make factorial experiments with many treatment arms vastly more feasible.

I love it. This is stuff that I’ve been talking about for a long time but have never actually done. These people really did it. Progress!