Wanted: Statistical success stories

Bill Harris writes:

Sometime when you get a free moment, it might be great to publish a post that links to good, current exemplars of analyses. There’s a current discussion about RCTs on a program evaluation mailing list I monitor. I posted links to your power=0.06 post and your Type S and Type M post, but some still seem to think RCTs are the foundation. I can say “Read one of your books” or “Read this or that book,” or I could say “Peruse your blog for the last, oh, eight-ten years,” but either one requires a bunch of dedication. I could say “Read some Stan examples,” but those seem mostly focused on teaching Stan. Some published examples use priors you no longer recommend, as I recall. I think I’ve noticed a few models with writeups on your blog that really did begin to show how one can put together a useful analysis without getting into NHST and RCTs, but I don’t recall where they are.

Relatedly, Ramiro Barrantes-Reynolds writes:

I would be very interested in seeing more in your blog about research that does a good job in the areas that are most troublesome for you: measurement, noise, forking paths, etc; or that addresses those aspects so as to make better inferences. I think after reading your blog I know what to look for to see when some investigator (or myself) is chasing noise (i.e. I have a sense of what NOT to do), but I am missing good examples to follow in order to do better research – I would consider myself a beginning statistician so examples of research that is well done and addresses the issues of forking paths, measurement, etc help. I think blog posts and the discussion that arises would be beneficial to the community.

So, two related questions. The first one’s about causal inference beyond simple analyses of randomized trials; the second is about examples of good measurement and inference in the context of forking paths.

My quick answer is that, yes, we do have examples in our books, and it doesn’t involve that much dedication to order them and take a look at the examples. I also have a bunch of examples here and here.

More specifically:

Causal inference without a randomized trial: Millennium villages, incumbency advantage (and again)

Measurement: penumbras, assays

Forking paths: Millennium villages, polarization

I guess the other suggestion is that we post on high-quality new work so we can all discuss, not just what makes bad work bad, but also what makes good work good. That makes sense. To start with, you should start pointing me to some good stuff to post on.

No, its not correct to say that you can be 95% sure that the true value will be in the confidence interval

Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.
Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.

Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval does contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval does not contain the true value. Thus, once we see the interval, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

Claims about excess road deaths on “4/20” don’t add up

Sam Harper writes:

Since you’ve written about similar papers (that recent NRA study in NEJM, the birthday analysis) before and we linked to a few of your posts, I thought you might be interested in this recent blog post we wrote about a similar kind of study claiming that fatal motor vehicle crashes increase by 12% after 4:20pm on April 20th (an annual cannabis celebration…google it).

The post is by Harper and Adam Palayew, and it’s excellent. Here’s what they say:

A few weeks ago a short paper was published in a leading medical journal, JAMA Internal Medicine, suggesting that, over the 25 years from 1992-2016, excess cannabis consumption after 4:20pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. Here is the key result from the paper:

In total, 1369 drivers were involved in fatal crashes after 4:20 PM on April 20 whereas 2453 drivers were in fatal crashes on control days during the same time intervals (corresponding to 7.1 and 6.4 drivers in fatal crashes per hour, respectively). The risk of a fatal crash was significantly higher on April 20 (relative risk, 1.12; 95% CI, 1.05-1.19; P = .001).
— Staples JA, Redelmeier DA. The April 20 Cannabis Celebration and Fatal Traffic Crashes in the United States JAMA Int Med, Feb 18, 2018, p.E2

Naturally, this sparked (heh) considerable media interest, not only because p<.05 and the finding is “surprising”, but also because cannabis is a hot topic these days (and, of course, April 20th happens every year).

But how seriously should we take these findings? Harper and Palayew crunch the numbers:

If we try and back out some estimates of what might have to happen on 4/20 to generate a 12% increase in the national rate of fatal car crashes, it seems less and less plausible that the 4/20 effect is reliable or valid. Let’s give it a shot. . . .

Over the 25 year period [the authors of the linked paper] tally 1369 deaths on 4/20 and 2453 deaths on control days, which works out to average deaths on those days each year of 1369/25 ~ 55 on 4/20 and 2453/25/2 ~ 49 on control days, an average excess of about 6 deaths each year. If we use our estimates of post-1620h VMT above, that works out to around 55/2.5 = 22 fatal crashes per billion VMT on 4/20 vs. 49/2.5 = 19.6 on control days. . . .

If we don’t assume the relative risk changes on 4/20, just more people smoking, what proportion of the population would need to be driving while high to generate a rate of 22 per billion VMT? A little algebra tells us that to get to 22 we’d need to see something like . . . 15%! That’s nearly one-sixth of the population driving while high on 4/20 from 4:20pm to midnight, which doesn’t, absent any other evidence, seem very likely. . . . Alternatively, one could also raise the relative risk among cannabis drivers to 6x the base rate and get something close. Or some combination of the two. This means either the nationwide prevalence of driving while using cannabis increases massively on 4/20, or the RR of a fatal crash with the kind of cannabis use happening on 4/20 is absurdly high. Neither of these scenarios seem particularly likely based on what we currently know about cannabis use and driving risks.

They also look at the big picture:

Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part. It’s similar to that NRA study (see link at beginning of this post) in that the numbers just don’t add up.

Harper sent me this email last year. I wrote the above post and scheduled it for 4/20. In the meantime, he had more to report:

We published a replication paper with some additional analysis. The original paper in question (in JAMA Internal Med no less) used a design (comparing an index ‘window’ on a given day to the same ‘window’ +/- 1 week) similar to some others that you have blogged about (the NRA study, for example), and I think it merits similar skepticism (a sizeable fraction of the population would need to be driving while drugged/intoxicated on this day to raise the national rate by such a margin).

As I said, my co-author Adam Palayew and I replicated that paper’s findings but also showed that their results seem much more consistent with daily variations in traffic crashes throughout the year (lots of noise) and we used a few other well known “risky” days (July 4th is quite reliable for excess deaths from traffic crashes) as a comparison. We also used Stan to fit some partial pooling models to look at how these “effects” may vary over longer time windows.

I wrote an updated blog post about it here.

And the gated version of the paper is now posted on Injury Prevention’s website, but we have made a preprint and all of the raw data and code to reproduce our work available at my Open Science page.

Stan!

A question about the piranha problem as it applies to A/B testing

Wicaksono Wijono writes:

While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects are out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

Lessons about statistics and research methods from that racial attitudes example

Yesterday we shared some discussions of recent survey results on racial attitudes.

For students and teachers of statistics or research methods, I think the key takeaway should be that you don’t want to pull out just one number from a survey; you want to get the big picture by looking at multiple questions, multiple years, and multiple data sources. You want to use the secret weapon.

Where do formal statistical theory and methods come in here? Not where you might think. No p-values or Bayesian inferences in the above-linked discussion, not even any confidence intervals or standard errors.

But that doesn’t mean that formal statistics are irrelevant, not at all.

Formal statistics gets used in the design and analysis of these surveys. We use probability and statistics to understand and design sampling strategies (cluster sampling, in the case of the General Social Survey) and to adjust for differences between sample and population (poststratification and survey weights, or, if these adjustments are deemed not necessary, statistical methods are used to make that call too).

Formal statistics underlies this sort of empirical work in social science—you just don’t see it because it was already done before you got to the data.

Changing racial differences in attitudes on changing racial differences

Elin Waring writes:

Have you been following the release of GSS results this year? I had been vaguely aware that there was reporting on a few items but then I happened to run the natrace and natracey variables (I use these in my class to look at question wording), they are from the are we spending too much/too little/about the right amont on “Improving the conditions of blacks” and “aid to blacks” (the images are from the SDA website at Berkeley):

Much as I [Waring] would love to believe that the American public really has changed racial attitudes, I find such a huge shift over such a short time very unlikely given what we know about stability of attitudes. And I even broke it down by age and there was a shift for all the age groups.

Then I saw this, and a colleague mentioned to me that the results for proportion not sexually active were strange. And then today people talking about the increase in the proportion not religiously affiliated.

It just seems very odd to me and I wondered if you had noticed it too. Could it be they just hit a strange cluster in their sampling? Or a weighting error of some kind? It’s true that attitudes on gay marriage changed very fast and that seems real, but this seems so surprising across so many separate issues.

I wasn’t sure so I passed this along to David Weakliem, my go-to guy when it comes to making sense of surveys and public opinion. Weakliem responded with some preliminary thoughts:

It did seem hard to believe at first. But there was a big move from 2014 to 2016 too (bigger than 2016-8), so if there is a problem with the survey it’s not just with 2018. The GSS also has a general question about whether the government has a special obligation to help blacks vs. no special treatment, and that also showed large moves in a liberal direction from 2014-6 and again from 2016-8. Finally, I looked for relevant questions from other surveys. There are some about how much discrimination there is. In 2013 and 2014, 19% and then 17% said there was a lot of discrimination against “African Americans” but in 2015 it was 36%; in 2016 and 2017 the question referred to “blacks” and 40% said there was a lot. So it seems that there really has been a substantial change in opinions about race since 2014. As far as why, I would guess that the media coverage and videos of police mistreatment of blacks had an impact—they made people think there really is a problem.

To which Waring replied:

The one thing I’d say in response to David is that while he could be right, these are shifts across a number of the long term variables not just the racial attitudes. Also I think that GSS is intentionally designed to not be so responsive to day to day fluctuations based on the latest news. And POLHITOK sees an increase in “no” responses in 2018 but not so dramatic and it looks like it’s in the same general territory as others from 2006 forward.

What really made me look at those particular variables was all the recent talk about reparations for slavery.

I also saw that Jay Livingston, who I wish had his own column in the New York Times—I’d rather see a sociologist’s writing about sociology, than an ignorant former reporter’s writing about sociology—wrote something recently on survey attitudes regarding racial equality, but using a different data source:

Just last week, Pew published a report (here) about race in the US. Among many other things, it asked respondents about the “major” reasons that Black people “have a harder time getting ahead.” As expected, Whites were more likely to point to cultural/personal factors, Blacks to structural ones. But compared with a similar survey Pew did just three years ago, it looks like everyone is becoming more woke. . . .

For “racial discrimination,” Black-White difference remains large. But in both groups, the percentage citing it as a major cause increases – by 14 points among Blacks, by nearly 20 points among Whites. The percent identifying access to good schools as an important factor have not changed so much, increasing slightly among both Blacks and Whites.

More curious are the responses about jobs. In 2013, far more Whites than Blacks said that the lack of jobs was a major factor. In the intervening three years, jobs as a reason for not getting head became more salient among Blacks, less so among Whites.

At the same time, “culture of poverty” explanations became less popular.

Livingston continues with some GSS data and then concludes:

If both Whites and Blacks are paying more attention to racial discrimination and less to personal-cultural factors, if everyone is more woke, how does this square with the widely held perception that in the era of Trump, racism is on the rise. (In the Pew survey, 56% over all and 49% of Whites said Trump has made race relations worse. In no group, even self-identified conservatives, does anything coming even close to a majority say that Trump has made race relations better.)

The data here points to a more complex view of recent history. The nastiest of the racists may have felt freer to express themselves in word and deed. And when they do, they make the news. Hence the widespread perception that race relations have deteriorated. But surveys can tell us what we don’t see on the news and Twitter. And in this case what they tell us is that the overall trend among Whites has been towards more liberal views on the causes of race differences in who gets ahead.

Interesting. Also an increasing proportion of Americans are neither white nor black. So lots going on here.

Abandoning statistical significance is both sensible and practical

Valentin Amrhein​, Sander Greenland, Blakeley McShane, and I write:

Dr Ioannidis writes against our proposals [here and here] to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.”

Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.

We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.

Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance.

Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture.

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

I especially like our title and our last paragraph!

Let me also emphasize that we have a lot of positive advice of how researchers can design studies and collect and analyze data (see for example here, here, and here). “Abandon statistical significance” is not the main thing we have to say. We’re writing about statistical significance to do our best to clear up some points of confusion, but our ultimate message in most of our writing and practice is to offer positive alternatives.

P.S. Also to clarify: “Abandon statistical significance” does not mean “Abandon statistical methods.” I do think it’s generally a good idea to produce estimates accompanied by uncertainty statements. There’s lots and lots to be done.

State-space models in Stan

Michael Ziedalski writes:

For the past few months I have been delving into Bayesian statistics and have (without hyperbole) finally found statistics intuitive and exciting. Recently I have gone into Bayesian time series methods; however, I have found no libraries to use that can implement those models.

Happily, I found Stan because it seemed among the most mature and flexible Bayesian libraries around, but is there any guide/book you could recommend me for approaching state space models through Stan? I am referring to more complex models, such as those found in State-Space Models, by Zeng and Wu, as well as Bayesian Analysis of Stochastic Process Models, by Insua et al. Most advanced books seem to use WinBUGS, but that library is closed-source and a bit older.

I replied that he should you post his question on the Stan mailing list and also look at the example models and case studies for Stan.

I also passed the question on to Jim Savage, who added:

Stan’s great for time series, though mostly because it just allows you to flexibly write down whatever likelihood you want and put very flexible priors on everything, then fits it swiftly with a modern sampler and lets you do diagnoses that are difficult/impossible elsewhere!

Jeff Arnold has a fairly complete set of implementations for state-space models in Stan here. I’ve also got some more introductory blog posts that might help you get your head around writing out some time-series models in Stan. Here’s one on hierarchical VAR models. Here’s another on Hamilton-style regime-switching models. I’ve got a half-written tutorial on state-space models that I’ll come back to when I’m writing the time-series chapter in our Bayesian econometrics in Stan book.

One of the really nice things about Stan is that you can write out your state as parameters. Because Stan can efficiently sample from parameter spaces with hundreds of thousands of dimensions (if a bit slowly), this is fine. It’ll just be slower than a standard Kalman filter. It also changes the interpretation of the state estimate somewhat (more akin to a Kalman smoother, given you use all observations to fit the state).

Here’s an example of such a model.

Actually that last model had some problems with the between-state correlations, but I guess it’s still a good example of how to put something together in Markdown.

All statistical conclusions require assumptions.

Mark Palko points us to this 2009 article by Itzhak Gilboa, Andrew Postlewaite, and David Schmeidler, which begins:

This note argues that, under some circumstances, it is more rational not to behave in accordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage’s axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian.

The paper expresses various misconceptions, for example the statement that the Bayesian approach requires a “subjective belief.” All statistical conclusions require assumptions, and a Bayesian prior distribution can be as subjective or un-subjective as any other assumption in the model. For example, I don’t recall seeing textbooks on statistical methods referring to the subjective belief underlying logistic regression or the Poisson distribution; I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.

More generally, it seems obvious to me that no statistical method will work best under all circumstances, hence I have no disagreement whatsoever with the opening sentence quoted above. I can’t quite see why they need 12 pages to make this argument, but whatever.

P.S. Also relevant is this discussion from a few years ago: The fallacy of the excluded middle—statistical philosophy edition.

Works of art that are about themselves

I watched Citizen Kane (for the umpteenth time) the other day and was again struck by how it is a movie about itself. Kane is William Randolph Hearst, but he’s also Orson Welles, boy wonder, and the movie Citizen Kane is self-consciously a masterpiece.

Some other examples of movies that are about themselves are La La Land, Primer (a low-budget experiment about a low-budget experiment), and Titanic (the biggest movie ever made, about the biggest boat ever made).

I want to call this, Objects of the Class X, but I’m not sure what X is.