Here are some examples of real-world statistical analyses that don’t use p-values and significance testing.

Joe Nadeau writes:

I’ve followed the issues about p-values, signif. testing et al. both on blogs and in the literature. I appreciate the points raised, and the pointers to alternative approaches. All very interesting, provocative.

My question is whether you and your colleagues can point to real world examples of these alternative approaches. It’s somewhat easy to point to mistakes in the literature. It’s harder, and more instructive, to learn from good analyses of empirical studies.

My reply:

I have lots of examples of alternative approaches; see the applied papers here.

And here are two particular examples:

The Millennium Villages Project: a retrospective,observational, endline evaluation

Analysis of Local Decisions Using Hierarchical Modeling, Applied to Home Radon Measurement and Remediation

Attorney General of the United States less racist than Nobel prize winning biologist

This sounds pretty bad:

The FBI was better off when “you all only hired Irishmen,” [former Attorney General] Sessions said in one diatribe about the bureau’s workforce. “They were drunks but they could be trusted. . . .”

But compare to this from Mister Helix:

[The] historic curse of the Irish . . . is not alcohol, it’s not stupidity. . . it’s ignorance. . . . some anti-Semitism is justified. Just like some anti-Irish feeling is justified . . .

And “who would want to adopt an Irish kid?”

Watson elaborated:

You can be real dumb or you can seem dumb because you don’t know anything — that’s all I’m saying. The Irish seemed dumb because they didn’t know anything.

He seems to be tying himself into knots, trying to reconcile old-style anti-Irish racism with modern-day racism in which there’s a single white race. You see, he wants to say Irish are inferior, but he can’t say their genes are worse, so he puts it down to “ignorance.”

Overall, I’d have to say Sessions is less of a bigot: Sure, he brings in a classic stereotype, but in a positive way!

Lots of us say stupid and obnoxious things in private. One of the difficulties of being a public figure is that even your casual conversation can be monitored. It must be tough to be in that position, and I can see how at some point you might just give up and let it all loose, Sessions or Watson style, and just go full-out racist.

For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

Statistical models are placeholders. We lay down a model, fit it to data, use the fitted model to make inferences about quantities of interest (qois), check to see if the model’s implications are consistent with data and substantive information, and then go back to the model and alter, fix, update, augment, etc.

Given that models are placeholders, we’re interested in the dependence of inferences on model assumptions. In particular, with Bayesian inference we’re often concerned about the prior.

With that in mind, awhile ago I came up with this recommendation.

For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

The idea here is that if the prior distribution is informative in this way, it can make sense to think harder about it, rather than just accepting the placeholder.

I’ve been interested in using this idea and formalizing it, and then the other day I got an email from Virginia Gori, who wrote:

I recently read your contribution to the Stan wiki page on priors choice recommendations, suggesting to ensure that the ratio of the standard deviations of the posterior and the prior(s) is at least 0.1 to assesss how informative priors are.

I found it very useful, and would like to use it in a publication. Searching online, I could only find this criteria in the Stan manual. I wonder if there’s a peer reviewed publication on this I should reference.

I have no peer-reviewed publication, or even any clear justification of the idea, nor have I seen it in the literature. But it could be there.

So this post serves several functions:

– It’s something that Gori can point to as a reference, if the wiki isn’t enough.

– It’s a call for people (You! Blog readers and commenters!) to point us to any relevant literature, including ideally some already-written paper by somebody else proposing the above idea.

– It’s a call for people (You! Blog readers and commenters!) to suggest some ideas for how to write up the above idea in a sensible way so we can have an Arxiv paper on the topic.

Conditional probability and police shootings

A political scientist writes:

You might have already seen this, but in case not: PNAS published a paper [Officer characteristics and racial disparities in fatal officer-involved shootings, by David Johnson, Trevor Tress, Nicole Burkel, Carley Taylor, and Joseph Cesario] recently finding no evidence of racial bias in police shootings:

Jonathan Mummolo and Dean Knox noted that the data cannot actually lead to any substantive conclusions one way or another, because the authors invert the conditional probability of interest (actually, the problem is a little more complicated, involving assumptions about base rates). They wrote a letter to PNAS pointing this out, but unfortunately PNAS decided not to publish it.

Maybe blogworthy? (If so, maybe immediately rather than on normal lag given prominence of study?)

OK, here it is.

“Study finds ‘Growth Mindset’ intervention taking less than an hour raises grades for ninth graders”

I received this press release in the mail:

Study finds ‘Growth Mindset’ intervention taking less than an hour raises grades for ninth graders

Intervention is first to show national applicability, breaks new methodological ground

– Study finds low-cost, online growth mindset program taking less than an hour can improve ninth graders’ academic achievement
– The program can be used for free in high schools around U.S. and Canada
– Researchers developed rigorous new study design that can help identify who could benefit most from intervention and under which social contexts

A groundbreaking study of more than 12,000 ninth grade U.S. students has revealed how a brief, low-cost, online program that takes less than an hour to complete can help students develop a growth mindset and improve their academic achievement. A growth mindset is the belief that a person’s intellectual abilities are not fixed and can be further developed.

Published in the journal Nature on August 7, the nationally representative study showed that both lower- and higher-achieving students benefited from the program. Lower-achieving students had significantly higher grades in ninth grade, on average, and both lower- and higher-achieving students were more likely to enroll in more challenging math courses their sophomore year. The program increased achievement as much as, and in some cases more than, previously evaluated, larger-scale education interventions costing far more and taking far longer to complete. . . .

The National Study of Learning Mindsets is as notable for its methodology to investigate the differences, or heterogeneity, in treatment effects . . . the first time an experimental study in education or social psychology has used a random, nationally representative sample—rather than a convenience sample . . .

Past studies have shown mixed effects for growth mindset interventions, with some showing small effects and others showing larger ones.

“These mixed findings result from both differences in the types of interventions, as well as from not using nationally representative samples in ways that rule out other competing hypotheses,” [statistician Elizabeth] Tipton said. . . .

The researchers hypothesized that the effects of the mindset growth intervention would be stronger for some types of schools and students than others and designed a rigorous study that could test for such differences. Though the overall effect might be small when looking at all schools, particular types of schools, such as those performing in the bottom 75% of academic achievement, showed larger effects from the intervention.

More here.

I’m often skeptical about studies that appear in the tabloids and get promoted via press release, and I guess I’m skeptical here too—but I know a lot of the people involved in this one, and I think they know what they’re doing. Also I think I helped out in the design of this study, so it’s not like I’m a neutral observer here.

One thing that does bother me is all the p-values in the paper and, in general, the reliance on classical analysis. Given that the goal of this research is to recognize variation in treatment effects, I think it should be reasonable to expect lots of the important aspects of the model to not be estimated very precisely from data (remember 16). So I’m thinking that, instead of strewing the text with p-values, there should be a better way to summarize inferences for interactions. Along similar lines, I’m guessing they could do better using Bayesian multilevel analysis to partially pool estimated interactions toward zero, rather than simple data comparisons which will be noisy. I recognize that many people consider classical analysis to be safer or more conservative, but statistical significance thresholding can just add noise; I think it’s partial pooling that will give results that are more stable and more likely to stand up under replication. This is not to say that I think the conclusions in the article are wrong; also, just at the level of the statistics, I think by far the most important issues are those identified by Tipton in the above-linked press release. I just think there’s more that can be done. Later on in the article they do include multilevel models, and so maybe it’s just that I’d like to see those analyses, including non-statistically-significant results, more fully incorporated into the discussion.

It appears the data and code are available here, so other people can do their own analyses, perhaps using multilevel modeling and graphical displays of grids of comparisons get a clearer picture of what can be learn from the data.

In any case, this topic is potentially very important—a effective intervention lasting an hour—so I’m glad that top statisticians and education researchers are working on it. Here’s how Yeager et al. conclude:

The combined importance of belief change and school environments in our study underscores the need for interdisciplinary research to understand the numerous influences on adolescents’ developmental trajectories.

Hey, look! The R graph gallery is back.

We’ve recommended the R graph gallery before, but then it got taken down.

But now it’s back! I wouldn’t use it on its own as a teaching tool, in that it has a lot of graphs that I would not recommend (see here), but it’s a great resource, so thanks so much to Yan Holtz for putting this together. He has a Python graph gallery too at the same site.

Are supercentenarians mostly superfrauds?

Ethan Steinberg points to a new article by Saul Justin Newman with the wonderfully descriptive title, “Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans,” which begins:

The observation of individuals attaining remarkable ages, and their concentration into geographic sub-regions or ‘blue zones’, has generated considerable scientific interest. Proposed drivers of remarkable longevity include high vegetable intake, strong social connections, and genetic markers. Here, we reveal new predictors of remarkable longevity and ‘supercentenarian’ status. In the United States, supercentenarian status is predicted by the absence of vital registration. The state-specific introduction of birth certificates is associated with a 69-82% fall in the number of supercentenarian records. In Italy, which has more uniform vital registration, remarkable longevity is instead predicted by low per capita incomes and a short life expectancy. Finally, the designated ‘blue zones’ of Sardinia, Okinawa, and Ikaria corresponded to regions with low incomes, low literacy, high crime rate and short life expectancy relative to their national average.

In summary:

As such, relative poverty and short lifespan constitute unexpected predictors of centenarian and supercentenarian status, and support a primary role of fraud and error in generating remarkable human age records.

Supercentenarians are defined as “individuals attaining 110 years of age.”

I’ve skimmed the article but not examined the data or the analysis—we can leave that to the experts—but, if what Newman did is correct, it’s a great story about the importance of measurement in learning about the world.

Holes in Bayesian Philosophy: My talk for the philosophy of statistics conference this Wed.

4pm Wed 7 Aug 2019 at Virginia Tech (via videolink):

Holes in Bayesian Philosophy

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Every philosophy has holes, and it is the responsibility of proponents of a philosophy to point out these problems. Here are a few holes in Bayesian data analysis: (1) flat priors immediately lead to terrible inferences about things we care about, (2) subjective priors are incoherent, (3) Bayes factors don’t work, (4) for the usual Cantorian reasons we need to check our models, but this destroys the coherence of Bayesian inference. Some of the problems of Bayesian statistics arise from people trying to do things they shouldn’t be trying to do, but other holes are not so easily patched.

A weird new form of email scam

OK, we all know that spam we get—sometimes spoofed as if from our own email address!—telling us to click on some link.

Scene 1

The other day I got a new sort of spam. It was from a colleague, the subject line was “Are you available in campus,” and the email went like this:

On Feb 9, 2019, at 11:44 AM, ** <**> wrote:

Hello are you there?

with a legitimate-looking signature line with this professor’s title.

Seemed a bit brief, but who knows? I responded when I got the email, several hours later, saying that I was not around right then.

I completely forgot about all this until I received the following email today from a completely different colleague, subject line “Are you on campus,” with the following content:

On Feb 13, 2019, at 4:56 PM, ** <**> wrote:

Are you free at the moment ?

Again, the message ended with a legitimate-looking signature line.

This seemed odd, so I checked the emails carefully and noticed that they were not the actual emails of these two colleagues.

OK, so it’s some sort of scam. But, as is often the case, I can’t figure out the plan. I’m gonna respond to this email and then . . . what, exactly? I mean, whoever’s doing the scam already has my email, so what do they get out of me responding to some fake address?

I can’t figure this one out.

Allowing intercepts and slopes to vary in a logistic regression: how does this change the ROC curve?

Jonathan Hughes writes:

I am an engineering doctoral student. As part of my dissertation I’m proposing a mode of adaptation for a predictive system to individual subgroup specific streams of data which come each from a specific subgroup of a mixture population distribution. As part of the proposal presentation someone referenced your work and believed that you may have address the problem described below. I have read many of your academic writings and I don’t know if it is the case, and I haven’t been able to find it.

I will explain the problem briefly:

Let M_p be a logistic regression model that assumes a single homogeneous population logit(pi) = beta + beta_1*x + noise, but where there are latent subgroups in the population with varying distributions (but same in form), i.e. the true case is modeled by

M_s := logit(pi) = beta_pop + beta_subgroup*indicator + beta_1*x + beta_1_subgroup*indicator*x + noise;

What is the expected gain in ROC* area under the curve, AUC, from including the subgroup information? i.e. what is E[AUC(M_s) – AUC(M_p)], under some reasonable assumptions? I would like to incorporate some theoretical results about this, either those I have been deriving myself or others’ with priority.

My reply: This looks like a varying-intercept, varying-slope logistic regression of the sort that is described in various places including my book with Jennifer Hill, with the twist that the groups are unknown. I have no results on area under the curve or the ROC curve more generally, so I suggest you explore this using fake-data simulation. For your data at hand, you can evaluate how much gain you’re getting by using leave-one-out cross-validation.