What is the most important real-world data processing tip you’d like to share with others?

This question was in today’s jitts for our communication class. Here are some responses:

Invest the time to learn data manipulation tools well (e.g. tidyverse). Increased familiarity with these tools often leads to greater time savings and less frustration in future.

Hmm it’s never one tip.. I never ever found it useful to begin writing code especially on a greenfield project unless I thought of the steps to the goal. I often still write the code in outline form first and edit before entering in programming steps. Some other tips.
1. Choose the right tool for the right job. Don’t use C++ if you’re going to design a web site.
2. Document code well but don’t overdo it, and leave some unit tests or assertions inside a commented field.
3. Testing code will always show the presence of bugs not their absence ( Dijkstra) but that dosen’t mean you should be a slacker.
4. Keep it simple at first, you may have to rewrite the program several times if it’s something new so don’t optimize until you’re satisfied. Finally, If you can control the L1 cache, you can control the world (Sabini).

Just try stuff. Nothing works the first time and you’ll have to throw out your meticulous plan once you actually start working. You’ll find all the hiccups and issues with your data the more time you actually spend in it.

Consider the sampling procedure and the methods (specifics of the questionnaire etc.) of data collection for “real-world” data to avoid any serious biases or flaws.

Quadruple-check your group by statements and joins!!

Cleaning data properly is essential.

Write a script to analyze the data. Don’t do anything “manually”.

Don’t be afraid to confer with others. Even though there’s often an expectation that we all be experts in all things data processing, the fact is that we all have different strengths and weaknesses and it’s always a good idea to benefit from others’ expertise.

For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

While converting data types, for example, extracting integers or convert to date, always check the agreement between data before and after convention. Sometimes when I was converting levels to integers, (numerical values somehow are recorded as categorical because of the existence of NA), there are errors and the results are not what I expected (e.g. convert “3712” to “1672”).

Learn dplyr.

Organisation of files and ideas are vital – constantly leave reminders of what you were doing and why you made particular choices either within the file names (indicating perhaps the date in which the code or data was updated) or within comments throughout the code that explain why you made certain decisions.

Thanks, kids!

“How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions” . . . and still stays around even after it’s been retracted

Chuck Jackson points to two items of possible interest:

Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, by Richard Harris. Review here by Leonard Freedman.

Retractions do not work very well, by Ken Cor and Gaurav Sood. This post by Tyler Cowen brought this paper to my attention.

Here’s a quote from Harris’s review:

Harris shows both sides of the reproducibility debate, noting that many eminent members of the research establishment would like to see this new practice of airing the scientific community’s dirty laundry quietly disappear. He describes how, for example, in the aftermath of their 2012 paper demonstrating that only 6 of 53 landmark studies in cancer biology could be reproduced, Glenn Begley and Lee Ellis were immediately attacked by some in the biomedical research aristocracy for their “naïveté,” their “lack of competence” and their “disservice” to the scientific community.

“The biomedical research aristocracy” . . . I like that.

From Cor and Sood’s abstract:

Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article.

I’m reminded of this story: “A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.”

This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine, indeed more appropriate and scientific, for the same reason that we don’t say that Steph Curry is 6’2.84378″ tall.

Emile Bravo and agency

I was reading Tome 4 of the adventures of Jules (see the last item here), and it struck me how much agency the characters had. They seemed to be making their own decisions, saying what they wanted to say, etc.

Just as a contrast, I’m also reading an old John Le Carre book, and here the characters have no agency at all. They’re just doing what is necessary to make the plot run. For Le Carre, that’s fine; the plot’s what it’s all about. So that’s an extreme case.

Anyway, I found the agency of Bravo’s characters refreshing. It’s not something I think about so often when reading, but this time it struck me.

P.S. I wrote about agency a few years ago in the context of Benjamin Kunkel’s book Indecision. I did a quick search and it doesn’t look like Kunkel has written much since. Too bad. But maybe he’s doing a Klam and it will be all right.

Research topic on the geography of partisan prejudice (more generally, county-level estimates using MRP)

1. An estimate of the geography of partisan prejudice

My colleagues David Rothschild and Tobi Konitzer recently published this MRP analysis, “The Geography of Partisan Prejudice: A guide to the most—and least—politically open-minded counties in America,” written up by Amanda Ripley, Rekha Tenjarla, and Angela He.

Ripley et al. write:

In general, the most politically intolerant Americans, according to the analysis, tend to be whiter, more highly educated, older, more urban, and more partisan themselves. This finding aligns in some ways with previous research by the University of Pennsylvania professor Diana Mutz, who has found that white, highly educated people are relatively isolated from political diversity. They don’t routinely talk with people who disagree with them; this isolation makes it easier for them to caricature their ideological opponents. . . . By contrast, many nonwhite Americans routinely encounter political disagreement. They have more diverse social networks, politically speaking, and therefore tend to have more complicated views of the other side, whatever side that may be. . . .

The survey results are summarized by this map:

I’m not a big fan of the discrete color scheme, which creates all sorts of discretization artifacts—but let’s leave that for another time. In future iterations of this project we can work on making the map clearer.

There are some funny things about this map and I’ll get to them in a moment, but first let’s talk about what’s being plotted here.

There are two things that go into the above map: the outcome measure and the predictive model, and it’s all described this post from David and Tobi.

First, the outcome. They measured partisan prejudice by asking 14 partisan-related questions, from “How would you react if a member of your immediate family married a Democrat?” to “How well does the term ‘Patriotic’ describe Democrats? to “How do you feel about Democratic voters today?”, asking 7 questions about each of the two parties and then fitting an item-response model to score each respondent who is a Democrat or Republican on how tolerant, or positive, they are about the other party.

Second, the model. They took data from 2000 survey responses and regressed these on individual and neighborhood (census block)-level demographic and geographic predictors to construct a model to implicitly predict “political tolerance” for everyone in the country, and then they poststratified, summing these up over estimated totals for all demographic groups to get estimates for county averages, which is what they plotted.

Having done the multilevel modeling and poststratification, they could plot all sorts of summaries, for example a map of estimated political tolerance just among whites, or a scatterplot of county-level estimated political tolerance vs. average education at the county level, or whatever. But we’ll focus on the map above.

2. Two concerns with the map and how it’s constructed

People have expressed two concerns about David and Tobi’s estimates.

First, the inferences are strongly model-based. If you’re getting estimates for 3000 counties from 2000 respondents—or even from 20,000 respondents, or 200,000—you’ll need to lean on a model. As a results, the map should not be taken to represent independent data within each county; rather, it’s a summary of a national-level model including individual and neighborhood (census block-level) predictors. As such, we want to think about ways of understanding and evaluating this model.

Second, the map shows some artifacts at state borders, most notably with Florida, South Carolina, New York state, South Dakota, Utah, and Wisconsin, also some suggestive patterns elsewhere such as the borders between Virginia and North Carolina, and Missouri and Arkansas. I’m not sure about all these—as noted above, the discrete color scheme can create apparent patterns from small variation, and there are real differences in political cultures between states (Utah comes to mind)—but there are definitely some problems here, problems which David and Tobi attribute to differences between states in the voter files that are used to estimate the total number of partisans (Democrats and Republicans) in each demographic category in each county. If the voter files for neighboring states are coming from different sorts of data, this can introduce apparent differences in the poststratification stage. Their counting problems are especially cumbersome because we have to estimate the total number of partisans in each demographic category in each county

3. Four plans for further research

So, what to do about these concerns? I have four ideas, all of which involve some mix of statistics and political science research, along with good old data munging:

(a) Measurement error model for differences between states in classifications. The voter files have different meanings in different states? Model it, with some state effects that are estimated from the data and using whatever additional information we can find on the measurement and classification process.

(b) Varying intercept model plus spatial correlation as a fix to the state boundary problems. This is kind of a light, klugey version of the above option. We recognize that some state-level fix is needed, and instead of modeling the measurement error or coding differences directly, we throw in a state-level error term, along with a spatial correlation penalty term to enforce similarity across county boundaries (maybe only counting counties that are similar in certain characteristics such as ethnic breakdown and proportion urban/suburban/rural).

(c) Tracking down exactly what happened to create those artifacts at the state boundaries. Before or after doing the modeling to correct the glaring boundary artifacts, it would be good to do some model analysis to work out the “trail of breadcrumbs” explaining exactly how the particular artifacts we see arose, to connect the patterns on the map with what was going on in the data.

(d) Fake-data simulation to understand scenarios where the MRP approach could fail. As noted in point 2 above, there are legitimate concerns about the use of any model-based approach to draw inferences for 3000 counties from 2000 (or even 20,000 or 200,000) respondents. One way to get a sense of potential problems here is to construct some fake-data worlds in which the model-based estimates will fail.

OK, so four research directions here. My inclination is to start with (b) and (d) because I’m kind of intimidated by the demographic classifications in the voter file, so I’d rather just consider them as a black box and try to fix them indirectly, rather than to model and understand them. Along similar lines, it seems to me that solving (b) and (d) will give us general tools that can be used in many other adjustment problems in sampling and causal inference. That said, (a) is appealing because it’s all about doing things right, and it could have real impact on future studies using the voter file, and (c) would be an example of building bridges between different models in statistical workflow, which is an idea I’ve talked about a lot recently, so I’d like to see that too.

“Heckman curve” update: The data don’t seem to support the claim that human capital investments are most effective when targeted at younger ages.

David Rea and Tony Burton write:

The Heckman Curve describes the rate of return to public investments in human capital for the disadvantaged as rapidly diminishing with age. Investments early in the life course are characterised as providing significantly higher rates of return compared to investments targeted at young people and adults. This paper uses the Washington State Institute for Public Policy dataset of program benefit cost ratios to assess if there is a Heckman Curve relationship between program rates of return and recipient age. The data does not support the claim that social policy programs targeted early in the life course have the largest returns, or that the benefits of adult programs are less than the cost of intervention.

Here’s the conceptual version of the curve, from a paper published by economist Heckman in 2006:

This graph looks pretty authoritative but of course it’s not directly data-based.

As Rea and Burton explain, the curve makes some sense:

Underpinning the Heckman Curve is a comprehensive theory of skills that encompass all forms of human capability including physical and mental health . . .

• skills represent human capabilities that are able to generate outcomes for the individual and society;

• skills are multiple in nature and cover not only intelligence, but also non cognitive skills, and health (Heckman and Corbin, 2016);

• non cognitive skills or behavioural attributes such as conscientiousness, openness to experience, extraversion, agreeableness and emotional stability are particularly influential on a range of outcomes, and many of these are acquired in early childhood;

• early skill formation provides a platform for further subsequent skill accumulation . . .

• families and individuals invest in the costly process of building skills; and

• disadvantaged families do not invest sufficiently in their children because of information problems rather than limited economic resources or capital constraints (Heckman, 2007; Cunha et al., 2010; Heckman and Mosso, 2015).

Early intervention creates higher returns because of a longer payoff over which to generate returns.

But the evidence is not so clear. Rea and Burton write:

The original papers that introduced the Heckman Curve cited evidence on the relative return of human capital interventions across early childhood education, schooling, programs for at-risk youth, university and active employment and training programs (Heckman, 1999).

I’m concerned about these all being massive overestimates because of the statistical significance filter (see for example section 2.1 here or my earlier post here). The researchers have every motivation to exaggerate the effects of these interventions, and they’re using statistical methods that produce exaggerated estimates. Bad combination.

Rea and Burton continue:

A more recent review by Heckman and colleagues is contained in an OECD report Fostering and Measuring Skills: Improving Cognitive and Non-Cognitive Skills to Promote Lifetime Success (Kautz et al., 2014). . . . Overall 27 different interventions were reviewed . . . twelve had benefit cost ratios reported . . . Consistent with the Heckman Curve, programs targeted to children under five have an average benefit cost ratio of around 7, while those targeted at older ages have an average benefit cost ratio of just under 2.


This result is however heavily influenced by the inclusion of the Perry Preschool programme and the Abecedarian Project. These studies are somewhat controversial in the wider literature . . . Many researchers argue that the Perry Preschool programme and the Abecedarian Project do not provide a reliable guide to the likely impacts of early childhood education in a modern context . . .

Also the statistical significance filter. A defender of those studies might argue that these biases don’t matter because they could be occurring for all studies, not just early childhood interventions. But these biases can be huge, and in general it’s a mistake to ignore huge biases in the vague hope that they may be canceling out.


The data on programs targeted at older ages do not appear to be entirely consistent with the Heckman Curve. In particular the National Guard Challenge program and the Canadian Self-Sufficiency Project provide examples of interventions targeted at older age groups which have returns that are larger than the cost of funds.

Overall the programs in the OECD report represent only a small sample of the human capital interventions with well measured program returns . . . many rigorously studied and well known interventions are not included.

So Rea and Burton decide to perform a meta-analysis:

In order to assess the Heckman Curve we analyse a large dataset of program benefit cost ratios developed by the Washington State Institute for Public Policy.

Since the 1980s the Washington State Institute for Public Policy has focused on evidence-based policies and programs with the aim of providing state policymakers with advice about how to make best use of taxpayer funds. The Institute’s database covers programs in a wide range of areas including child welfare, mental health, juvenile and adult justice, substance abuse, healthcare, higher education and the labour market. . . .

The August 2017 update provides estimates of the benefit cost ratios for 314 interventions. . . . The programs also span the life course with 10% of the interventions being aimed at children 5 years and under.

And here’s what they find:

Wow, that’s one ugly graph! Can’t you do better than that? I also don’t really know what to do with these numbers. Benefit-cost ratios of 90! That’s the kind of thing you see with, what, a plan to hire more IRS auditors? I guess what I’m saying is that I don’t know which of these dots I can really trust, which is a problem with a lot of meta-analyses (see for example here).

To put it another way: Given what I see in Rea and Burton’s paper, I’m prepared to agree with their claim that the data don’t support the diminishing-returns “Heckman curve”: The graph from that 2006 paper, reproduced at the top of this post, is just a story that’s not backed up by what is known. At that same time, I don’t know how seriously to take the above scatterplot, as many or even most of the dots there could be terrible estimates. I just don’t know.

In their conclusion, Rea and Burton say that their results do not “call into question the more general theory of human capital and skills advanced by Heckman and colleagues.” They express the view that:

Heckman’s insights about the nature of human capital are essentially correct. Early child development is a critical stage of human development, partly because it provides a foundation for the future acquisition of health, cognitive and non-cognitive skills. Moreover the impact of an effective intervention in childhood has a longer period of time over which any benefits can accumulate.

Why, then, do the diminishing returns of interventions not show up in the data? Rea and Burton write:

The importance of early child development and the nature of human capital are not the only factors that influence the rate of return for any particular intervention. Overall the extent to which a social policy investment gives a good rate of return depends on the assumed discount rate, the cost of the intervention, the interventions ability to impact on outcomes, the time profile of impacts over the life course, and the value of the impacts.

Some interventions may be low cost which will make even modest impacts cost effective.

The extent of targeting and the deadweight loss of the intervention are also important. Some interventions may be well targeted to those who need the intervention and hence offer a good rate of return. Other interventions may be less well targeted and require investment in those who do not require the intervention. A potential example of this might be interventions aimed at reducing youth offending. While early prevention programs may be effective at reducing offending, they are not necessarily more cost effective than later interventions if they require considerable investment in those who are not at risk.

Another consideration is the proximity of an intervention to the time where there are the largest potential benefits. . . .

Another factor is that the technology or active ingredients of interventions differ, and it is not clear that those targeted to younger ages will always be more effective. . . .

In general there are many circumstances where interventions to deliver ‘cures’ can be as cost effective as ‘prevention’. Many aspects of life have a degree of unpredictability and interventions targeted as those who experience an adverse event (such as healthcare in response to a car accident) can plausibly be as cost effective as prevention efforts.

These are all interesting points.

P.S. I sent Rea some of these comments, and he wrote:

I had previously read your paper ‘The failure of the null hypothesis’ paper, and remember being struck by the para:

The current system of scientific publication encourages the publication of speculative papers making dramatic claims based on small, noisy experiments. Why is this? To start with, the most prestigious general-interest journals—Science, Nature, and PNAS—require papers to be short, and they strongly favor claims of originality and grand importance….

I had thought at the time that this applied to the original Heckman paper in Science.

I think we agree with your point about not being able to draw any positive conclusions from our data. The paper is meant to be more in the spirit of ‘here is an important claim that has been highly influential in public policy, but when we look at what we believe is a carefully constructed dataset, we don’t see any support for the claim’. We probably should frame it more about replication and an invitation for other researchers to try and do something similar using other datasets.

Your point about the underlying data drawing on effect sizes that are likely biased is something we need to reflect in the paper. But in defense of the approach, my assumption is that well conducted meta analysis (which Washington State Institute for Public Policy use to calculate their overall impacts) should moderate the extent of the bias. Searching for unpublished research, and including all robust studies irrespective of the magnitude and significance of the impact, and weighting by each studies precision, should overcome some of the problems? In their meta analysis, Washington State also reduce a studies contribution to the overall effect size if there is evidence of a conflict of interest (the researcher was also the program developer).

On the issue of the large effect sizes from the early childhood education experiments (Perry PreSchool and Abecedarian Project), the recent meta analysis of high quality studies by McCoy et al. (2017) was helpful for us.

Generally the later studies have shown smaller impacts (possibly because control group are now not so deprived of other services). Here is one of their lovely forest plots on grade retention. I am just about to go and see if they did any analysis of publication bias.

Treatment interactions can be hard to estimate from data.

Brendan Nyhan writes:

Per #3 here, just want to make sure you saw the Coppock Leeper Mullinix paper indicating treatment effect heterogeneity is rare.

My reply:

I guess it depends on what is being studied. In the world of evolutionary psychology etc., interactions are typically claimed to be larger than main effects (for example, that claim about fat arms and redistribution). It is possible that in the real world, interactions are not so large.

To step back a moment, I don’t think it’s quite right to say that treatment effect heterogeneity is “rare.” All treatment effects vary. So the question is not, Is there treatment effect heterogeneity?, but rather, How large is treatment effect heterogeneity? In practice, heterogeneity can be hard to estimate, so all we can say is that, whatever variation there is in the treatment effects, we can’t estimated it well from the data alone.

In real life, when people design treatments, they need to figure out all sorts of details. Presumably the details matter. These details are treatment interactions, and they’re typically designed entirely qualitatively, which makes sense given the difficulty of estimating their effects from data.

“The Long-Run Effects of America’s First Paid Maternity Leave Policy”: I need that trail of breadcrumbs.

Tyler Cowen links to a research article by Brenden Timpe, “The Long-Run Effects of America’s First Paid Maternity Leave Policy,” that begins as follows:

This paper provides the first evidence of the effect of a U.S. paid maternity leave policy on the long-run outcomes of children. I exploit variation in access to paid leave that was created by long-standing state differences in short-term disability insurance coverage and the state-level roll-out of laws banning discrimination against pregnant workers in the 1960s and 1970s. While the availability of these benefits sparked a substantial expansion of leave-taking by new mothers, it also came with a cost. The enactment of paid leave led to shifts in labor supply and demand that decreased wages and family income among women of child-bearing age. In addition, the first generation of children born to mothers with access to maternity leave benefits were 1.9 percent less likely to attend college and 3.1 percent less likely to earn a four-year college degree.

I was curious so I clicked through and took a look. It seems that the key comparisons are at the state-year level, with some policy changes happening in different states at different years. So what I’d like to see are some time series for individual states and some scatterplots of state-years. Some other graphs, too, although I’m not quite sure what. The basic idea is that this is an observational study in which the treatment is some policy change, so we’re comparing state-years with and without this treatment; I’d like to see a scatterplot of the outcome vs. some pre-treatment measure, with different symbols for treatment and control cases. As it is, I don’t really know what to make of the results, what with all the processing that has gone on between the data and the estimate.

In general I am skeptical about results such as given in the above abstract because there are so many things that can affect college attendance. Trends can vary by state, and this sort of analysis will simply pick up whatever correlation there might be, between state-level trends and the implementation of policies. There are lots of reasons to think that the states where a given policy would be more or less likely to be implemented, happen to be states where trends in college attendance are higher or lower. This is all kind of vague because I’m not quite sure what is going on in the data—I didn’t notice a list of which states were doing what. My general point is that to understand and trust such an analysis I need a “trail of bread crumbs” connecting data, theory, and conclusions. The theory in the paper, having to do with economic incentives and indirect effects, seemed a bit farfetched to me but not impossible—but it’s not enough for me to just have the theory and the regression table; I really need to understand where in the data the result is coming from. As it is, this just seems like two state-level variables that happen to be correlated. There might be something here; I just can’t say.

P.S. Cowen’s commenters express lots of skepticism about this claim. I see this skepticism as a good sign, a positive aspect of the recent statistical crisis in science that people do not automatically accept this sort of quantitative claim, even when it is endorsed by a trusted intermediary. I suspect that Cowen too is happy that his readers read him critically and don’t believe everything he posts!

What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?

The punch line

“Your readers are my target audience. I really want to convince them that it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.”

The background

It started with an email from Erik van Zwet, who wrote:

In 2013, you wrote about the hidden dangers of non-informative priors:

Finally, the simplest example yet, and my new favorite: we assign a non-informative prior to a continuous parameter theta. We now observe data, y ~ N(theta, 1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 0.84 that theta>0. I don’t believe that 0.84. I think (in general) that it is too high.

I agree – at least if theta is a regression coefficient (other than the intercept) in the context of the life sciences.

In this paper [which has since been published in a journal], I propose that a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator. The posterior is the normal distribution with mean y/2 and standard deviation SE/sqrt(2). So that’s a default Edlin factor of 1/2. I base my proposal on two very different arguments:

1. The uniform (flat) prior is considered by many to be non-informative because of certain invariance properties. However, I argue that those properties break down when we reparameterize in terms of the sign and the magnitude of theta. Now, in my experience, the primary goal of most regression analyses is to study the direction of some association. That is, we are interested primarily in the sign of theta. Under the prior I’m proposing, P(theta > 0 | y) has the standard uniform distribution (Theorem 1 in the paper). In that sense, the prior could be considered to be non-informative for inference about the sign of theta.

2. The fact that we are considering a regression coefficient (other than the intercept) in the context of the life sciences is actually prior information. Now, almost all research in the life sciences is listed in the MEDLINE (PubMed) database. In the absence of any additional prior information, we can consider papers in MEDLINE that have regression coefficients to be exchangeable. I used a sample of 50 MEDLINE papers to estimate the prior and found the normal distribution with mean zero and standard deviation 1.28*SE. The data and my analysis are available here.

The two arguments are very different, so it’s nice that they yield fairly similar results. Since published effects tend to be inflated, I think the 1.28 is somewhat overestimated. So, I end up recommending the N(0,SE^2) as default prior.

I think it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.

Hmmm . . . one way to think about this idea is to consider where it doesn’t make sense. You write, “a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator.” Let’s consider two cases where this default won’t work:

– The task is to estimate someone’s weight with one measurement on a scale where the measurements have standard deviation 1 pound, and you observe 150 pounds. You’re not going to want to partially pool that all the way to 75 pounds. The point here, I suppose, is that the goal of the measurement is not to estimate the sign of the effect. But we could do the same reasoning where the goal was to estimate the sign. For example, I weigh you, then I weigh you again a year later. I’m interested in seeing if you gained or lost weight. The measurement was 150 pounds last year and 140 pounds this year. The classical estimate of the difference of the two measurements is 10 +/- 1.4. Would I want to partially pool that all the way to 5? Maybe, in that these are just single measurements and your weight can fluctuate. But that can’t be the motivation here, because we could just as well take 100 measurements at one time and 100 measurements a year later, so now maybe your average is, say, 153 pounds last year and 143 pounds this year: an estimated change of 10 +/- 0.14. We certainly wouldn’t want to use a super-precise prior with mean 0 an sd 0.14 here!

– The famous beauty-and-sex-ratio study where the difference in probability of girl birth, comparing children of beautiful and non-beautiful parents, was estimated from some data to be 8 percentage points +/- 3 percentage points. In this case, an Edlin factor of 0.5 is not enough. Pooling down to 4 percentage points is not enough pooling. A better estimate would of the difference be 0 percentage points, or 0.01 percentage points, or something like that.

I guess what I’m getting at is that the balance between prior and data changes as we get more information, so I don’t see how a fixed amount of partial pooling can work.

That said, maybe I’m missing something here. After all, a default can never cover all cases, and the current default of no partial pooling or flat prior has all sorts of problems. So we can think more about this.

P.S. In the months since I wrote the above post, Zwet sent along further thoughts:

Since I emailed you in the fall, I’ve continued thinking about default priors. I have a clearer idea now about what I’m trying to do:

In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular. Second, the literature is not always reliable because of publication bias and such. Third, it is generally unclear what the scope of the meta-analysis should be.

Now, researchers often want to be “objective” or “non-informative”. I believe this can be accomplished by performing a meta-analysis with a very wide scope. One might think that this would lead to very diffuse priors, but that turns out not to be the case! Using a very wide scope to obtain prior information also means that the same meta-analysis can be recycled in many situations.

The problem of publication bias in the literature remains, but there may be ways to handle that. In the paper I sent earlier, I used p-values from univariable regressions that were used to “screen” variables for a multivariable model. I figure that those p-values should be largely unaffected by selection on significance, simply because that selection is still to be done!

More recently, I’ve used a set of “honest” p-values that were generated by the Open Science Collaboration in their big replication project in psychology (Science, 2015). I’ve estimated a prior and then computed type S and M errors. I attach the results together with the (publicly available) data. The results are also here.

Zwet’s new paper is called Default prior for psychological research, and it comes with two data files, here and here.

It’s an appealing idea, in practice should be better than the current default Edlin factor of 1 (that is, no partial pooling toward zero at all). And I’ve talked a lot about constructing default priors based on empirical information, so it’s great to see someone actually doing it. Still, I have some reservations about the specific recommendations, for the reasons expressed in my response to Zwet above. Like him, I’m curious about your thoughts on this.

I’ll also wrote something on this in our Prior Choice Recommendations wiki:

Default prior for treatment effects scaled based on the standard error of the estimate

Erik van Zwet suggests an Edlin factor of 1/2. Assuming that the existing or published estimate is unbiased with known standard error, this corresponds to a default prior that is normal with mean 0 and sd equal to the standard error of the data estimate. This can’t be right–for any given experiment, as you add data, the standard error should decline, so this would suggest that the prior depends on sample size. (On the other hand, the prior can often only be understood in the context of the likelihood; http://www.stat.columbia.edu/~gelman/research/published/entropy-19-00555-v2.pdf, so we can’t rule out an improper or data-dependent prior out of hand.)

Anyway, the discussion with Zwet got me thinking. If I see an estimate that’s 1 se from 0, I tend not to take it seriously; I partially pool it toward 0. So if the data estimate is 1 se from 0, then, sure, the normal(0, se) prior seems reasonable as it pools the estimate halfway to 0. But if the data estimate is, say, 4 se’s from zero, I wouldn’t want to pool it halfway: at this point, zero is not so relevant. This suggests something like a t prior. Again, though, the big idea here is to scale the prior based on the standard error of the estimate.

Another way of looking at this prior is as a formalization of what we do when we see estimates of treatment effects. If the estimate is only 1 standard error away from zero, we don’t take it too seriously: sure, we take it as some evidence of a positive effect, but far from conclusive evidence–we partially pool it toward zero. If the estimate is 2 standard errors away from zero, we still think the estimate has a bit of luck to it–just think of the way in which researchers, when their estimate is 2 se’s from zero, (a) get excited and (b) want to stop the experiment right there so as not to lose the magic–hence some partial pooling toward zero is still in order. And if the estimate is 4 se’s from zero, we just tend to take it as is.

I sent some of the above to Zwet, who replied:

I [Zwet] proposed that default Edlin factor of 1/2 only when the estimate is less than 3 se’s away from zero (or rather, p<0.001). I used a mixture of two zero-mean normals; one with sd=0.68 and the other with sd=3.94. I’m quite happy with the fit. The shrinkage is a little more than 1/2 when the estimate is close to zero, and disappears gradually for larger estimates. It’s in the data! You can see it when you do a “wide scope” meta-analysis.

Thinking about “Abandon statistical significance,” p-values, etc.

We had some good discussion the other day following up on the article, “Retire Statistical Significance,” by Valentin Amrhein, Sander Greenland, and Blake McShane.

I have a lot to say, and it’s hard to put it all together, in part because my collaborators and I have said much of it already, in various forms.

For now I thought I’d start by listing my different thoughts in a short post while I figure out how best to organize all of this.


There’s also the problem that these discussions can easily transform into debates. After proposing an idea and seeing objections, it’s natural to then want to respond to those objections, then the responders respond, etc., and the original goals are lost.

So, before going on, some goals:

– Better statistical analyses. Learning from data in a particular study.

– Improving the flow of science. More prominence to reproducible findings, less time wasted chasing noise.

– Improving scientific practice. Changing incentives to motivate good science and demotivate junk science.

Null hypothesis testing, p-values, and statistical significance represent one approach toward attaining the above goals. I don’t think this approach works so well anymore (whether it did in the past is another question), but the point is to keep these goals in mind.

Some topics to address

1. Is this all a waste of time?

The first question to ask is, why am I writing about this at all? Paul Meehl said it all fifty years ago, and people have been rediscovering the problems with statistical-significance reasoning every decade since, for example this still-readable paper from 1985, The Religion of Statistics as Practiced in Medical Journals, by David Salsburg, which Richard Juster sent me the other day. And, even accepting the argument that the battle is still worth fighting, why don’t I just leave this in the capable hands of Amrhein, Greenland, McShane, and various others who are evidently willing to put in the effort?

The short answer is I think I have something extra to contribute. So far, my colleagues and I have come up with some new methods and new conceptualizations—I’m thinking of type M and type S errors, the garden of forking paths, the backpack fallacy, the secret weapon, “the difference between . . .,” the use of multilevel models to resolve the multiple comparisons problem, etc. We haven’t been just standing on the street corner the past twenty years, screaming “Down with p-values; we’ve been reframing the problem in interesting and useful ways.

How did we make these contributions? Not out of nowhere, but as a byproduct of working on applied problems, trying to work things out from first principles, and, yes, reading blog comments and answering questions from randos on the internet. When John Carlin and I write an article like this or this, for example, we’re not just expressing our views clearly and spreading the good word. We’re also figuring out much of it as we go along. So, when I see misunderstanding about statistics and try to clean it up, I’m learning too.

2. Paradigmatic examples

It could be a good idea to list the different sorts of examples that are used in these discussions. Here are a few that keep coming up:
The clinical trial comparing a new drug to the standard treatment. “Psychological Science” or “PNAS”-style headline-grabbing unreplicable noise mining. Gene-association studies. Regressions for causal inference from observational data. Studies with multiple outcomes. Descriptive studies such as in Red State Blue State.

I think we can come up with more of these. My point here is that different methods can work for different examples, so I think it makes sense to put a bunch of these cases in one place so the argument doesn’t jump around so much. We can also include some examples where p-values and statistical significance don’t seem to come up at all. For instance, MRP to estimate state-level opinion from national surveys: nobody’s out there testing which states are statistically significantly different from others. Another example is item-response or ideal-point modeling in psychometrics or political science: again, these are typically framed as problems of estimation, not testing.

3. Statistics and computer science as social sciences

We’re used to statistical methods being controversial, with leading statisticians throwing polemics at each other regarding issues that are both theoretically fundamental and also core practical concerns. The fighting’s been going on, in different ways, for about a hundred years!

But here’s a question. Why is it that statistics is so controversial? The math is just math, no controversy there. And the issues aren’t political, at least not in a left-right sense. Statistical controversies don’t link up in any natural way to political disputes about business and labor, or racism, or war, or whatever.

In its deep and persistent controversies, statistics looks less like the hard sciences and more like the social sciences. Which, again, seems strange to me, given that statistics is a form of engineering, or applied math.

Maybe the appropriate point of comparison here is not economics or sociology, which have deep conflicts based on human values, but rather computer science. Computer scientists can get pretty worked up about technical issues which to me seem unresolvable: the best way to structure a programming language, for example. I don’t like to label these disputes as “religious wars,” but the point is that the level of passion often seems pretty high, in comparison to the dry nature of the subject matter.

I’m not saying that passion is wrong! Existing statistical methods have done their part to slow down medical research: lives are at stake. Still, stepping back, the passion in statistical debates about p-values seems a bit more distanced from the ultimate human object of concern, compared to, say the passion in debates about economic redistribution or racism.

To return to the point about statistics and computer science: These two fields fundamentally are about how they are used. A statistical method or a computer ultimately connects to a human: someone has to decide what to do. So they both are social sciences, in a way that physics, chemistry, or biology are not, or not as much.

4. Different levels of argument

The direct argument in favor of the use of statistical significance and p-values is that it’s desirable to use statistical procedures with so-called type 1 error control. I don’t buy that argument because I think that selecting on statistical significance yields noisy conclusions. To continue the discussion further, I think it makes sense to consider particular examples, or classes of examples (see item 2 above). They talk about error control, I talk about noise, but both these concepts are abstractions, and ultimately it has to come down to reality.

There are also indirect arguments. For example: 100 million p-value users can’t be wrong. Or: Abandoning statistical significance might be a great idea, but nobody will do it. I’d prefer to have the discussion at the more direct level of what’s a better procedure to use, with the understanding that it might take awhile for better options to become common practice.

5. “Statistical significance” as a lexicographic decision rule

This is discussed in detail in my article with Blake McShane, David Gal, Christian Robert, and Jennifer Tackett:

[In much of current scientific practice], statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap 2010; Bem 2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise . . . We propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

6. Confirmationist and falsificationist paradigms of science

I wrote about this a few years ago:

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

It is my impression that in the vast majority of cases, “statistical significance” is used in confirmationist way. To put it another way: the problem is not just with the p-value, it’s with the mistaken idea that falsifying a straw-man null hypothesis is evidence in favor of someone’s pet theory.

7. But what if we need to make an up-or-down decision?

This comes up a lot. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically-fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Regarding the economics, the point that we made in section 4.4 of our paper is that decisions are not currently made in an automatic way. Papers are reviewed by hand, one at a time.

As Peter Dorman puts it:

The most important determinants of the dispositive power of statistical evidence should be its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.

To put it another way, there are two issues here: (a) the potential benefits of an automatic screening or decision rule, and (b) using a p-value (null-hypothesis tail area probability) for such a rule. We argue against using screening rules (or, to use them much less often). But in the cases where screening rules are desired, we see no reason to use p-values for this.

8. What should we do instead?

To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.

As Eric Loken and I put it:

Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.

For a couple more examples, consider the two studies discussed in section 2 of this article. For both of them, nothing is gained and much is lost by passing results through the statistical significance filter.

Again, the use of standard errors and uncertainty intervals is not just significance testing in another form. The point is to use these uncertainties as a way of contextualizing estimates, not to declare things as real or not.

The next step is to recognize multiplicity in your problem. Consider this paper, which contains many analyses but not a single p-value or even a confidence interval. We are able to assess uncertainty by displaying results from multiple polls. Yes, it is possible to have data with no structure at all—a simple comparison with no replications—and for these, I’d just display averages, variation, and some averages and uncertainties—but this is rare, as such simple comparisons are typically part of a stream of results in a larger research project.

One can and should continue with multilevel models and other statistical methods that allow more systematic partial pooling of information from different sources, but the secret weapon is a good start.


My current plan to write this all up as a long article, Unpacking the Statistical Significance Debate and the Replication Crisis, and put it on Arxiv. That could reach people who don’t feel like engaging with blogs.

In the meantime, I’d appreciate your comments and suggestions.

Impact of published research on behavior and avoidable fatalities

In a paper entitled, “Impact of published research on behavior and avoidable fatalities,” Addison Kramer, Alexandra Kirk, Faizaan Easton, and Bertram Hester write:

There has long been speculation of an “informational backfire effect,” whereby the publication of questionable scientific claims can lead to behavioral changes that are counterproductive in the aggregate. Concerns of informational backfire have been raised in many fields that feature an intersection of research and policy, including education, medicine, or nutrition—but it has been difficult to study this effect empirically because of confounding of the act of publication with the effects of the research ideas in question through other pathways. In the present paper we estimate the informational backfire effect using a unique identification strategy based on the timing of publication of high-profile articles in well-regarded scientific journals. Using measures of academic citation, traditional media mentions, and social media penetration, we show, first, that published claims backed by questionable research practices receive statistically significantly wider exposure, and, second, that this exposure leads to large and statistical significant aggregate behavioral changes, as measured by a regression discontinuity analysis. The importance of this finding can be seen using a case study in the domain of alcohol consumption, where we demonstrate that publication of research papers claiming a safe daily dose is linked to increased drinking and higher rates of drunk driving injuries and fatalities, with the largest proportional increases occurring in states with the highest levels of exposure to news media science and health reporting.

I don’t know how much to believe all this, as there are the usual difficulties of studying small effects using aggregate data—the needle-in-a-haystack problem—and I’d like to see the raw data. But in any case I wanted to share this with you, as it relates to various discussions we’ve had such as here, for example. Also this relates to general questions we’ve had regarding the larger effects of scientific research on our thoughts and behaviors.