Question 7 of our Applied Regression final exam (and solution to question 6)

Here’s question 7 of our exam:

7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

And the solution to question 6:

6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

(a) is true, (b) is false. The problem is purely predictive, and adding a good predictor should help (on average; sure, you could find individual examples where it would make things worse, but there’s no reason to think it wouldn’t help in the generically-described example above). When the goal is prediction (rather than estimating regression coefficients which will be given a direct causal interpretation), there’s no problem with adding this sort of informative predictor.

Common mistakes

Just about all the students got this one correct.

Question 6 of our Applied Regression final exam (and solution to question 5)

Here’s question 6 of our exam:

6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

And the solution to question 5:

5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

(a) is false. If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be negative.

(b) is false. It’s no problem at all to have more questions than students. Even in a classical regression, even without a multilevel model, this is typically no problem as long as each question is answered by a few different students.

Common mistakes

Most of the students had the impression that one of (a) or (b) had to be true, so a common response was to work through one of the two options, figure out that it was false, and then mistakenly conclude that the other one was true. I guess I should rephrase the question. Instead of “Which of the following statements are basically true?”, I could say, “For each of the following statements, say whether it is true or false.”

Question 5 of our Applied Regression final exam (and solution to question 4)

Here’s question 5 of our exam:

5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

And the solution to question 4:

4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?

(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.

(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.

Option (a) is wrong and option (b) is right. We discuss this in the missing-data chapter of the book. The point prediction from a regression model gives you something in the middle of the distribution. You need to add noise in order to approximate the correct spread.

Common mistakes

Almost all the students got this one correct.

Tony nominations mean nothing

Someone writes:

I searched up *Tony nominations mean nothing* and I found nothing. So I had to write this.

There are currently 41 theaters that the Tony awards accept when nominating their choices. If we are being as generous as possible, we could say that every one of those theaters will be hosting a performance that fits all of the requirements for an award. The Tony awards have 26 different categories. There are 129 nominations this year, not including the special categories. For a play in this day and age to not get a single nomination is just a testament to its mediocrity. Plays or People can even get multiple nominations in the same category. The Best Featured Actress in a Musical has these marvelous nominations:
Lilli Cooper, “Tootsie”
Amber Gray, “Hadestown”
Sarah Stiles, “Tootsie”
Ali Stroker, “Oklahoma!”
Mary Testa, “Oklahoma!”
People will frequently get nominated twice in the same category for different pieces!

According to the official Tony Awards website, “A show is only eligible in the season when it first opens, no matter how long it runs on Broadway.” This immediately gets rid of many current shows, and leaves only 21 shows by my counting. I may be slightly wrong, but that is still a very small number. If there are 129 possible nominations for your piece, and you are only 1 out of 29 possibilities, receiving a tony nomination is not a badge of honor, but a badge of shame. There was recently an article in the New York Times about how King Lear, a show that received mixed reviews, was disappointed that it only got 1 nomination. I’d like to see if anyone else can help me figure this out.

My reply: OK, so here’s the question. Why so many Tonys for so few shows, which would seem to reduce its value?

The most natural answer is that Tonys and Tony nominations give value to “Broadway theatre” more generally: the different shows are in friendly competition, and more awards and more nominations get the butts in the seats.

But that doesn’t really answer the question, as at some point there have to be diminishing returns. The real question is where’s the equilibrium.

Remember that post from a few years ago about the economist who argued that the members of the Motion Picture Academy were irrational because they were giving Oscars to insufficiently popular movies: “One would hope the Academy would at least pay a bit more attention to the people paying the bills. Not only does it seem wrong (at least to this economist) to argue that movies many people like are simply not that good, focusing on the box office would seem to make good financial sense for the Oscars as well”?

The discussion there led to familiar territory in econ-talk: How much should we think that an institution (e.g., the Oscars, the Tonys) is at a sensible equilibrium, kept there by a mixture of rational calculation and the discipline of the market, and how much should we focus on the institution’s imperfections (slowness to change, principal-agent problems, etc.) and suggest improvements?

One comparison point is academic awards. Different academic fields seem to have different rates of giving awards. It would just about always seem to make sense to add an award: for example, if the Columbia stat dept added a best research paper award for its Ph.D. students, I think this would at the margin help the recipients get jobs, more than it would hurt the prospects of the students who didn’t get the award. On balance it would benefit our program. But we don’t have such an award—or, at least, I don’t think we have. Maybe we should. The point is that it doesn’t seem that statistics academia has reached equilibrium when it comes to awards. Political science, that’s another story: they have zillions of awards, all over the place. Equilibrium may well have been reached in that case.

Dan Simpson or Brian Pike might have more thoughts on the specific case of the Tonys. Maybe someone could “at” them?

P.S. When I was a kid, nobody cared about the Tonys, Emmys, or Grammys. But every year we watched the Oscars, Miss America, and the Wizard of Oz.

Question 3 of our Applied Regression final exam (and solution to question 2)

Here’s question 3 of our exam:

Here is a fitted model from the Bangladesh analysis predicting whether a person with high-arsenic drinking water will switch wells, given the arsenic level in their existing well and the distance to the nearest safe well.

glm(formula = switch ~ dist100 + arsenic, family=binomial(link="logit")) coef.est coef.se
(Intercept) 0.00 0.08
dist100 -0.90 0.10
arsenic 0.46 0.04
n = 3020, k = 3

Compare two people who live the same distance from the nearest well but whose arsenic levels differ, with one person having an arsenic level of 0.5 and the other person having a level of 1.0. Approximately how much more likely is this second person to switch wells? Give an approximate estimate, standard error, and 95% interval.

And the solution to question 2:

2. A multiple-choice test item has four options. Assume that a student taking this question either knows the answer or does a pure guess. A random sample of 100 students take the item. 60% get it correct. Give an estimate and 95% confidence interval for the percentage in the population who know the answer.

Let p be the proportion of students in the population who would get the question correct. p has an estimate of 0.6 and a standard error of sqrt(0.5^2/100) = 0.05.

Let theta be the proportion of students in the population who actually know the answer. Based on the description above, we can write:
p = theta + 0.25*(1 – theta) = 0.25 + 0.75*theta,
thus theta = (p – 0.25)/0.75.
This gives us an estimate of theta of (0.6 – 0.25)/0.75 = 0.47 and a standard error of 0.05/0.75 = 0.07, so the 95% confidence interval is [0.47 +/- 2*0.07] = [0.31, 0.59]

Common mistakes

Most of the students had no idea what to do here, but some of them figured out how to solve for theta. None of them got the standard error correct. The students who figured out the estimate of 0.47 simply computed a standard error as sqrt(0.47*(1 – 0.47)/1000). Kinda frustrating. I’m not really sure how to teach this, although of course I could just assign this particular problem as homework and then maybe students would remember the general point about estimates and standard errors under transformations.

I’m also thinking this would be a good example to program up in Stan because then all these difficulties are handled automatically.

Still at work on the piranha theorems

We’re still at work on the piranha theorems. But, in the meantime, I happened to show somebody this:

There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data. The analogy is to a fish tank full of piranhas: it won’t take long before they eat each other.

And she said, wait, you better check to see if this is right. Are piranhas cannibals? That doesn’t seem right, if they’re cannibals they’ll just eat each other and die out. But if they’re not cannibals, the analogy doesn’t work.

So when I got home, I looked it up. I googled *are piranhas cannibals*. And this is the first thing that came up:

So my analogy is safe, and we’re good to go.

P.S. I guess I could’ve titled the above post, Are Piranhas Cannibals?, but that would’ve violated the anti-SEO principles of this blog. Our general rule is, make the titles as boring as possible, then anyone who clicks through to read the post will be pleasantly surprised by all the entertainment value we offer.

Question 1 of our Applied Regression final exam

As promised, it’s time to go over the final exam of our applied regression class. It was an in-class exam, 3 hours for 15 questions.

Here’s the first question on the test:

1. A randomized experiment is performed within a survey. 1000 people are contacted. Half the people contacted are promised a $5 incentive to participate, and half are not promised an incentive. The result is a 50% response rate among the treated group and 40% response rate among the control group.

(a) Give an estimate and standard error of the average treatment effect.

(b) Give code to fit a logistic regression of response on the treatment indicator. Give the complete code, including assigning the data, setting up the variables, etc. It is not enough to simply give the one line of code for running the logistic regression.

See tomorrow’s post for the solution and a discussion of common errors.

New! from Bales/Pourzanjani/Vehtari/Petzold: Selecting the Metric in Hamiltonian Monte Carlo

Ben Bales, Arya Pourzanjani, Aki Vehtari, and Linda Petzold write:

We present a selection criterion for the Euclidean metric adapted during warmup in a Hamiltonian Monte Carlo sampler that makes it possible for a sampler to automatically pick the metric based on the model and the availability of warmup draws. Additionally, we present a new adaptation inspired by the selection criterion that requires significantly fewer warmup draws to be effective. The effectiveness of the selection criterion and adaptation are demonstrated on a number of applied problems. An implementation for the Stan probabilistic programming language is provided.

And here’s their conclusion:

Adapting an effective metric is important for the performance of HMC. This paper outlines a criterion that can be used to automate the selection of an efficient metric from an array of options. In addition, we present a new low-rank adaptation scheme that makes it possible to sample effectively from highly correlated posteriors, even when few warmup draws are available. The selection criterion and the new adaptation are demonstrated to be effective on a number of different models.

All of the necessary eigenvalues and eigenvectors needed to evaluate the selection criterion and build the new adaptation can be computed efficiently with the Lanczos algorithm, making this method suitable for models with large numbers of parameters.

This research looks like it will have a big practical impact.

Why edit a journal? More generally, how to contribute to scientific discussion?

The other day I wrote:

Journal editing is a volunteer job, and people sign up for it because they want to publish exciting new work, or maybe because they enjoy the power trip, or maybe out of a sense of duty—but, in any case, they typically aren’t in it for the controversy.

Jon Baron, editor of the journal Judgment and Decision Making, saw this and wrote:

In my case, the reasons are “all three”! But it isn’t a matter of “exciting new work” so much as “solid work with warranted conclusions, even if boring”. This is a very old-fashioned experimental psychologist’s approach. Boring is good. And the “power” is not a trivial consideration; many things that academics do have the purpose of influencing their fields, and editing, for me, beats teaching, blogging, writing trade books, giving talks, or even . . . (although it does not beat writing a textbook).

I’ve been asked many times to edit journals but I’ve always said no because I’ve felt that, personally, I can make better contributions to the field as a loner. Editing a journal would require too much social skill for me. We each should contribute where we can.

Also recall this story:

I remember, close to 20 years ago, an economist friend of mine was despairing of the inefficiencies of the traditional system of review, and he decided to do something about it: He created his own system of journals. They were all online (a relatively new thing at the time), with an innovative transactional system of reviewing (as I recall, every time you submitted an article you were implicitly agreeing to review three articles by others) and a multi-tier acceptance system, so that very few papers got rejected; instead they were just binned into four quality levels. And all the papers were open-access or something like that.

The system was pretty cool, but for some reason it didn’t catch on—I guess that, like many such systems, it relied a lot on continuing volunteer efforts of its founder, and perhaps he just got tired of running an online publishing empire, and the whole thing kinda fell apart. The journals lost all their innovative aspects and became just one more set of social science publishing outlets. My friend ended up selling his group of journals to a traditional for-profit company, they were no longer free, etc. It was like the whole thing never happened.

A noble experiment, but not self-sustaining. Which was too bad, given that he’d put so much effort into building a self-sustaining structure.

Perhaps one lesson from my friend’s unfortunate experience is that it’s not enough to build a structure; you also need to build a community.

Another lesson is that maybe it can help to lean on some existing institution. This guy built up his whole online publishing company from scratch, which was kinda cool, but then when he no longer felt like running it, it dissolved, and then he ended up with a pile of money, which he probably didn’t need and he might never get around to spending, while losing the scientific influence, which is more interesting and important. Maybe it would’ve been better for him to have teamed up with an economics society, or with some university, governmental body, or public-interest organization.

Good intentions are not enough, and even good intentions + a lot of effort aren’t enough. You have to work with existing institutions, or create your own. This blog works in part because it piggybacks off the existing institution of blogging. Nowadays there isn’t much blogging anymore, but the circa 2005-era blogosphere was helpful in giving us a sense of how to set up our community. We built upon the strengths of the blogosphere and avoided some of the pitfalls.

Similarly this is the challenge of reforming scientific communication: to do something better while making use of existing institutions and channels whereby researchers donate their labor.

My (remote) talk this Friday 3pm at the Department of Cognitive Science at UCSD

It was too much to do one more flight so I’ll do this one in (nearly) carbon-free style using hangout or skype.

It’s 3pm Pacific time in CSB (Cognitive Science Building) 003 at the University of California, San Diego.

This is what they asked for in the invite:

Our Friday afternoon COGS200 series has been a major foundation of the Cognitive Science community and curriculum in our department for decades and is attended by faculty and students from diverse fields (e.g. anthropology, human-computer-interface/design, AI/machine learning, neuroscience, philosophy of mind, psychology, genetics, etc).

One of the goals of our Spring quarter series is to expose attendees to research on the cultural practices surrounding data acquisition, analysis, and interpretation. In particular, we were hoping to have a section exploring current methods in statistical inference, with an emphasis on designing analyses appropriate to the question being asked. If you are interested and willing, we would love for you to share your expertise on multilevel / hierarchical modeling—as well as your more general perspective on how scientists can better deploy statistical models for conducting good, replicable science. (Relevant papers that come to mind include your 2016 paper on “multiverse analyses”, as well as your 2017 “Abandon statistical significance” paper.)

I’m still not sure what’s the best thing to talk about. I guess I’ll start with what’s in that above paragraph and then go from there.