Abortion attitudes: The polarization is among richer, more educated whites

Abortion has been in the news lately. A journalist asked me something about abortion attitudes and I pointed to a post from a few years ago about partisan polarization on abortion. Also this with John Sides on why abortion consensus is unlikely. That was back in 2009, and consensus doesn’t seem any more likely today.

It’s perhaps not well known (although it’s consistent with what we found in Red State Blue State) that just about all the polarization on abortion comes from whites, and most of that is from upper-income, well-educated whites. Here’s an incomplete article that Yair and I wrote on this from 2010; we haven’t followed up on it recently.

Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!

Hey, people are doing the multiverse!

Elio Campitelli writes:

I’ve just saw this image in a paper discussing the weight of evidence for a “hiatus” in the global warming signal and immediately thought of the garden of forking paths.

From the paper:

Tree representation of choices to represent and test pause-periods. The ‘pause’ is defined as either no-trend or a slow-trend. The trends can be measured as ‘broken’ or ‘continuous’ trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of ‘historical’ versions of the datasets as they existed, or contemporary versions providing full dataset ‘hindsight’. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The ‘year’ rows are for assessments undertaken at each year in time.

Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the ‘pause’ (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the ‘pause’ as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree.

Actually, it’s the multiverse.

Data quality is a thing.

I just happened to come across this story, where a journalist took some garbled data and spun a false tale which then got spread without question.

It’s a problem. First, it’s a problem that people will repeat unjustified claims, also a problem that when data are attached, you can get complete credulity, even for claims that are implausible on the face of it.

So it’s good to be reminded: “Data” are just numbers. You need to know where the data came from before you can learn anything from them.

“In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous,” but “Over 20 journals turned down her paper . . . and nobody wanted to fund privacy research that might reach uncomfortable conclusions.”

Tom Daula writes:

I think this story from John Cook is a different perspective on replication and how scientists respond to errors.

In particular the final paragraph:

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In [Latanya] Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

I think most of your scientific error stories follow this pattern. The error is pointed out privately and then publicized. Of course in most of your posts a private email is met with hostility, the error is publicized, and then the scientist digs in. The good stories are when the authors admit and publicize the error themselves.

Replication, especially in psychology, fits into this because there is no “single responsible party” so “calling attention to the problem [is] the only way to make things better.”

I imagine Latanya Sweeney and you share similar frustrations.

It’s an interesting story. I was thinking about this recently when reading one of Edward Winter’s chess notes collections. These notes are full of stories of sloppy writers copying things without citation, reproducing errors that have appeared elsewhere, introducing new errors (see an example here with follow-up here). Anyway, what’s striking to me is that so many people just don’t seem to care about getting their facts wrong. Or, maybe they do care, but not enough to fix their errors or apologize or even thank the people who point out the mistakes that they’ve made. I mean, why bother writing a chess book if you’re gonna put mistakes in it? It’s not like you can make a lot of money from these things.

Sweeney’s example is of course much more important, but sometimes when thinking about a general topic (in this case, authors getting angry when their errors are revealed to the world) it can be helpful to think about minor cases too.

“MRP is the Carmelo Anthony of election forecasting methods”? So we’re doing trash talking now??

What’s the deal with Nate Silver calling MRP “the Carmelo Anthony of forecasting methods”?

Someone sent this to me:

and I was like, wtf? I don’t say wtf very often—at least, not on the blog—but this just seemed weird.

For one thing, Nate and I did a project together once using MRP: this was our estimate of attitudes on heath care reform by age, income, and state:

Without MRP, we couldn’t’ve done anything like it.

So, what gives?

Here’s a partial list of things that MRP has done:

– Estimating public opinion in slices of the population

– Improved analysis using the voter file

– Polling using the Xbox that outperformed conventional poll aggregates

– Changing our understanding of the role of nonresponse in polling swings

– Post-election analysis that’s a lot more trustworthy than exit polls

OK, sure, MRP has solved lots of problems, it’s revolutionized polling, no matter what Team Buggy Whip says.

That said, it’s possible that MRP is overrated. “Overrated” is a difference between rated quality and actual quality. MRP, wonderful as it is, might well be rated too highly in some quarters. I wouldn’t call MRP a “forecasting method,” but that’s another story.

I guess the thing that bugged me about the Carmelo Anthony comparison is that my impression from reading the sports news is not just that Anthony is overrated but that he’s an actual liability for his teams. Whereas I see MRP, overrated as it may be (I’ve seen no evidence that MRP is overrated but I’ll accept this for the purpose of argument), as still a valuable contributor to polling.

Ten years ago . . .

The end of the aughts. It was a simpler time. Nate Silver was willing to publish an analysis that used MRP. We all thought embodied cognition was real. Donald Trump was a reality-TV star. Kevin Spacey was cool. Nobody outside of suburban Maryland had heard of Beach Week.

And . . . Carmelo Anthony got lots of respect from the number crunchers.

Check this out:

So here’s the story according to Nate: MRP is like Carmelo Anthony because they’re both overrated. But Carmelo Anthony isn’t overrated, he’s really underrated. So maybe Nate’s MRP jab was just a backhanded MRP compliment?

Simpler story, I guess, is that back around 2010 Nate liked MRP and he liked Carmelo. Back then, he thought the people who thought Carmelo was overrated, were wrong. In 2018, he isn’t so impressed with either of them. Nate’s impression of MRP and Carmelo Anthony go up and down together. That’s consistent, I guess.

In all seriousness . . .

Unlike Nate Silver, I claim no expertise on basketball. For all I know, Tim Tebow will be starting for the Knicks next year!

I do claim some expertise on MRP, though. Nate described MRP as “not quite ‘hard’ data.” I don’t really know what Nate meant by “hard” data—ultimately, these are all just survey responses—but, in any case, I replied:

I guess MRP can mean different things to different people. All the MRP analyses I’ve ever published are entirely based on hard data. If you want to see something that’s a complete mess and is definitely overrated, try looking into the guts of classical survey weighting (see for example this paper). Meanwhile, Yair used MRP to do these great post-election summaries. Exit polls are a disaster; see for example here.

Published poll toplines are not the data, warts and all; they’re processed data, sometimes not adjusted for enough factors as in the notorious state polls in 2016. I agree with you that raw data is the best. Once you have raw data, you can make inferences for the population. That’s what Yair was doing. For understandable commercial reasons, lots of pollsters will release toplines and crosstabs but not raw data. MRP (or, more generally, RRP) is just a way of going from the raw data to make inference about the general population. It’s the general population (or the population of voters) that we care about. The people in the sample are just a means to an end.

Anyway, if you do talk about MRP and how overrated it is, you might consider pointing people to some of those links to MRP successes. Hey, here’s another one: we used MRP to estimate public opinion on health care. MRP has quite a highlight reel, more like Lebron or Steph or KD than Carmelo, I’d say!

One thing I will say is that data and analysis go together:

– No modern survey is good enough to be able to just interpret the results without any adjustment. Nonresponse is just too big a deal. Every survey gets adjusted, but some don’t get adjusted well.

– No analysis method can do it on its own without good data. All the modeling in the world won’t help you if you have serious selection bias.

Yair added:

Maybe it’s just a particularly touchy week for Melo references.

Both Andy and I would agree that MRP isn’t a silver bullet. But nothing is a silver bullet. I’ve seen people run MRP with bad survey data, bad poststratification data, and/or bad covariates in a model that’s way too sparse, and then over-promise about the results. I certainly wouldn’t endorse that. On the other side, obviously I agree with Andy that careful uses of MRP have had many successes, and it can improve survey inferences, especially compared to traditional weighting.

I think maybe you’re talking specifically about election forecasting? I haven’t seen comparisons of your forecasts to YouGov or PredictWise or whatever else. My vague sense pre-election was that they were roughly similar, i.e., that the meaty part of the curves overlapped. Maybe I’m wrong and your forecasts were much better this time—but non-MRP forecasters have also done much worse than you, so is that an indictment of MRP, or are you just really good at forecasting?

More to my main point—in one of your recent podcasts, I remember you said something about how forecasts aren’t everything, and people should look at precinct results to try to get beyond the toplines. That’s roughly what we’ve been trying to do in our post-election project, which has just gotten started. We see MRP as a way to combine all the data—pre-election voter file data, early voting, precinct results, county results, polling—into a single framework. Our estimates aren’t going to be perfect, for sure, but hopefully an improvement over what’s been out there, especially at sub-national levels. I know we’d do better if we had a lot more polling data, for instance. FWIW I get questions from clients all the time about how demographic groups voted in different states. Without state-specific survey data, which is generally unavailable and often poorly collected/weighted, not sure what else you can do except some modeling like MRP.

Maybe you’d rather see the raw unprocessed data like the precinct results. Fair enough, sometimes I do too! My sense is the people who want that level of detail are in the minority of the minority. Still, we’re going to try to do things like show the post-processed MRP estimates, but also some of the raw data to give intuition. I wonder if you think this is the right approach, or if you think something else would be better.

And Ryan Enos writes:

To follow up on this—I think you’ll all be interested in seeing the back and forth between Nate and Lynn Vavreck who was interviewing him. It was more of a discussion of tradeoffs between different approaches, then a discussion of what is wrong with MRP. Nate’s MRP alternative was to do a poll in every district, which I think we can all agree would be nice – if not entirely realistic. Although, as Nate pointed out, some of the efforts from the NY Times this cycle made that seem more realistic. In my humble opinion, Lynn did a nice job pushing Nate on the point that, even with data like the NY Times polls, you are still moving beyond raw data by weighting and, as Andrew points out, we often don’t consider how complex this can be (I have a common frustration with academic research about how much out of the box survey weights are used and abused).

I don’t actually pay terribly close attention to forecasting – but in my mind, Nate and everybody else in the business is doing a fantastic job and the YouGov MRP forecasts have been a revelation. From my perspective, as somebody who cares more about what survey data can teach us about human behavior and important political phenomenon, I think MRP has been a revelation in that it has allowed us to infer opinion in places, such as metro areas, where it would otherwise be missing. This has been one of the most important advances in public opinion research in my lifetime. Where the “overrated” part becomes true is that just like every other scientific advance, people can get too excited about what it can do without thinking about what assumptions are going into the method and this can lead to believing it can do more than it can—but this is true of everything.

Yair, to your question about presentation—I am a big believer in raw data and I think combining the presentation of MRP with something like precinct results, despite the dangers of ecological error, can be really valuable because it can allow people to check MRP results with priors from raw data.

It’s fine to do a poll in every district but then you’d still want to do MRP in order to adjust for nonresponse, estimate subgroups of the population, study public opinion in between the districtwide polls, etc.

Scandal! Mister P appears in British tabloid.

Tim Morris points us to this news article:

And here’s the kicker:

Mister P.

Not quite as cool as the time I was mentioned in Private Eye, but it’s still pretty satisfying.

My next goal: Getting a mention in Sports Illustrated. (More on this soon.)

In all seriousness, it’s so cool when methods that my collaborators and I have developed are just out there, for anyone to use. I only wish Tom Little were around to see it happening.

P.S. Some commenters are skeptical, though:

I agree that polls can be wrong. The issue is not so much the size of the sample but rather that the sample can be unrepresentative. But I do think that polls provide some information; it’s better than just guessing.

P.P.S. Unrelatedly, Morris wrote, with Ian White and Michael Crowther, this article on using simulation studies to evaluate statistical methods.

Fake-data simulation. Yeah.

Horse-and-buggy era officially ends for survey research

Peter Enns writes:

Given the various comments on your blog about evolving survey methods (e.g., Of buggy whips and moral hazards; or, Sympathy for the Aapor), I thought you might be interested that the Roper Center has updated its acquisitions policy and is now accepting non-probability samples and other methods. This is an exciting move for the Roper Center.

Jeez. I wonder what the President of American Association of Buggy-Whip Manufacturers thinks about that!

In all seriousness, let’s never forget that our inferences are only as good as our data. Whether your survey responses come by telephone, or internet, or any other method, you want to put in the effort to get good data from a representative sample, and then to adjust as necessary. There’s no easy solution, it just needs the usual eternal vigilance.

P.S. I’m posting this one now, rather than with the usual six-month delay, because you can now go to the Roper Center and get these polls. I didn’t want you to have to wait!

Name this fallacy!

It’s the fallacy of thinking that, just cos you’re good at something, that everyone should be good at it, and if they’re not, they’re just being stubborn and doing it badly on purpose.

I thought about this when reading this line from Adam Gopnik in the New Yorker:

[Henry Louis] Gates is one of the few academic historians who do not disdain the methods of the journalist . . .

Gopnik’s article is fascinating, and I have no doubt that Gates’s writing is both scholarly and readable.

My problem is with Gopnik’s use of the word “disdain.” The implication seems to be that other historians could write like journalists if they felt like it, but they just disdain to do so, maybe because they think it would be beneath their dignity, or maybe because of the unwritten rules of the academic profession.

The thing that Gopnik doesn’t get, I think, is that it’s hard to write well. Most historians can’t write like A. J. P. Taylor or Henry Louis Gates. Sure, maybe they could approach that level if they were to work hard at it, but it would take a lot of work, a lot of practice, and it’s not clear this would be the best use of their time and effort.

For a journalist to say that most academics “disdain the methods of the journalist” would be like me saying that most journalists “disdain the methods of the statistician.” OK, maybe some journalists actively disdain quantitative thinking—the names David Brooks and Gregg Easterbrook come to mind—but mostly I think it’s the same old story: math is hard, statistics is hard, these dudes are doing their best but sometimes their best isn’t good enough, etc. “Disdain” has nothing to do with it. To not choose to invest years of effort into a difficult skill that others can do better, to trust in the division of labor and do your best at what you’re best at . . . that can be a perfectly reasonable decision. If an academic historian does careful archival work and writes it up in hard-to-read prose—not on purpose but just cos hard-to-read prose is what he or she knows how to write—that can be fine. The idea would be that a journalist could write it up later for others. No disdaining. Division of labor, that’s all. Not everyone on the court has to be a two-way player.

I had a similar reaction a few years ago to Steven Pinker’s claim that academics often write so badly because “their goal is not so much communication as self-presentation—an overriding defensiveness against any impression that they may be slacker than their peers in hewing to the norms of the guild. Many of the hallmarks of academese are symptoms of this agonizing self­consciousness . . .” I replied that I think writing is just not so easy, and our discussion continued here.

Anyway, here’s the question. This fallacy, of thinking that when people can’t do what you can do, that they’re just being stubborn . . . is there a name for it? The Expertise Fallacy??

Give this one a good name, and we can add it to the lexicon.

Did blind orchestra auditions really benefit women?

You’re blind!
And you can’t see
You need to wear some glasses
Like D.M.C.

Someone pointed me to this post, “Orchestrating false beliefs about gender discrimination,” by Jonatan Pallesen criticizing a famous paper from 2000, “Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians,” by Claudia Goldin and Cecilia Rouse.

We’ve all heard the story. Here it is, for example, retold in a news article from 2013 that Pallesen links to and which I also found on the internet by googling *blind orchestra auditions*:

In the 1970s and 1980s, orchestras began using blind auditions. Candidates are situated on a stage behind a screen to play for a jury that cannot see them. In some orchestras, blind auditions are used just for the preliminary selection while others use it all the way to the end, until a hiring decision is made.

Even when the screen is only used for the preliminary round, it has a powerful impact; researchers have determined that this step alone makes it 50% more likely that a woman will advance to the finals. And the screen has also been demonstrated to be the source of a surge in the number of women being offered positions.

That’s what I remembered. But Pallesen tells a completely different story:

I have not once heard anything skeptical said about that study, and it is published in a fine journal. So one would think it is a solid result. But let’s try to look into the paper. . . .

Table 4 presents the first results comparing success in blind auditions vs non-blind auditions. . . . this table unambigiously shows that men are doing comparatively better in blind auditions than in non-blind auditions. The exact opposite of what is claimed.

Now, of course this measure could be confounded. It is possible that the group of people who apply to blind auditions is not identical to the group of people who apply to non-blind auditions. . . .

There is some data in which the same people have applied to both orchestras using blind auditions and orchestras using non-blind auditions, which is presented in table 5 . . . However, it is highly doubtful that we can conclude anything from this table. The sample sizes are small, and the proportions vary wildly . . .

In the next table they instead address the issue by regression analysis. Here they can include covariates such as number of auditions attended, year, etc, hopefully correcting for the sample composition problems mentioned above. . . . This is a somewhat complicated regression table. Again the values fluctuate wildly, with the proportion of women advanced in blind auditions being higher in the finals, and the proportion of men advanced being higher in the semifinals. . . . in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.

Hmmm . . . OK, we better go back to the original published article. I notice two things from the conclusion.

First, some equivocal results:

The question is whether hard evidence can support an impact of discrimination on hiring. Our analysis of the audition and roster data indicates that it can, although we mention various caveats before we summarize the reasons. Even though our sample size is large, we identify the coefficients of interest from a much smaller sample. Some of our coefficients of interest, therefore, do not pass standard tests of statistical significance and there is, in addition, one persistent result that goes in the opposite direction. The weight of the evidence, however, is what we find most persuasive and what we have emphasized. The point estimates, moreover, are almost all economically significant.

This is not very impressive at all. Some fine words but the punchline seems to be that the data are too noisy to form any strong conclusions. And the bit about the point estimates being “economically significant”—that doesn’t mean anything at all. That’s just what you get when you have a small sample and noisy data, you get noisy estimates so you can get big numbers.

But then there’s this:

Using the audition data, we find that the screen increases—by 50 percent—the probability that a woman will be advanced from certain preliminary rounds and increases by severalfold the likelihood that a woman will be selected in the final round.

That’s that 50% we’ve been hearing about. I didn’t see it in Pallesen’s post. So let’s look for it in the Goldin and Rouse paper. It’s gotta be in the audition data somewhere . . . Also let’s look for the “increases by severalfold”—that’s even more, now we’re talking effects of hundreds of percent.

The audition data are described on page 734:

We turn now to the effect of the screen on the actual hire and estimate the likelihood an individual is hired out of the initial audition pool. . . . The definition we have chosen is that a blind audition contains all rounds that use the screen. In using this definition, we compare auditions that are completely blind with those that do not use the screen at all or use it for the early rounds only. . . . The impact of completely blind auditions on the likelihood of a woman’s being hired is given in Table 9 . . . The impact of the screen is positive and large in magnitude, but only when there is no semifinal round. Women are about 5 percentage points more likely to be hired than are men in a completely blind audition, although the effect is not statistically significant. The effect is nil, however, when there is a semifinal round, perhaps as a result of the unusual effects of the semifinal round.

That last bit seems like a forking path, but let’s not worry about that. My real question is, Where’s that “50 percent” that everybody’s talkin bout?

Later there’s this:

The coefficient on blind [in Table 10] in column (1) is positive, although not significant at any usual level of confidence. The estimates in column (2) are positive and equally large in magnitude to those in column (1). Further, these estimates show that the existence of any blind round makes a difference and that a completely blind process has a somewhat larger effect (albeit with a large standard error).

Huh? Nothing’s statistically significant but the estimates “show that the existence of any blind round makes a difference”? I might well be missing something here. In any case, you shouldn’t be running around making a big deal about point estimates when the standard errors are so large. I don’t hold it against the authors—this was 2000, after all, the stone age in our understanding of statistical errors. But from a modern perspective we can see the problem.

Here’s another similar statement:

The impact for all rounds [columns (5) and (6)] [of Table 9] is about 1 percentage point, although the standard errors are large and thus the effect is not statistically significant. Given that the probability of winning an audition is less than 3 percent, we would need more data than we currently have to estimate a statistically significant effect, and even a 1-percentage-point increase is large, as we later demonstrate.

I think they’re talking about the estimates of 0.011 +/- 0.013 and 0.006 +/- 0.013. To say that “the impact . . . is about 1 percentage point” . . . that’s not right. The point here is not to pick on the authors for doing what everybody used to do, 20 years ago, but just to emphasize that we can’t really trust these numbers.

Anyway, where’s the damn “50 percent” and the “increases by severalfold”? I can’t find it. It’s gotta be somewhere in that paper, I just can’t figure out where.

Pallesen’s objections are strongly stated but they’re not new. Indeed, the authors of the original paper were pretty clear about its limitations. The evidence was all in plain sight.

For example, here’s a careful take posted by BS King in 2017:

Okay, so first up, the most often reported findings: blind auditions appear to account for about 25% of the increase in women in major orchestras. . . . [But] One of the more interesting findings of the study that I have not often seen reported: overall, women did worse in the blinded auditions. . . . Even after controlling for all sorts of factors, the study authors did find that bias was not equally present in all moments. . . .

Overall, while the study is potentially outdated (from 2001…using data from 1950s-1990s), I do think it’s an interesting frame of reference for some of our current debates. . . . Regardless, I think blinding is a good thing. All of us have our own pitfalls, and we all might be a little better off if we see our expectations toppled occasionally.

So where am I at this point?

I agree that blind auditions can make sense—even if they do not had the large effects claimed in that 2000 paper, or indeed even if they have no aggregate relative effects on men and women at all. What about that much-publicized “50 percent” claim, or for that matter the not-so-well-publicized but even more dramatic “increases by severalfold”? I have no idea. I’ll reserve judgment until someone can show me where that result appears in the published paper. It’s gotta be there somewhere.

P.S. See comments for some conjectures on the “50 percent” and “severalfold.”