Dorothy Parker (2) vs. Simone Biles; Liebling advances

I was surprised to see so little action in the comments yesterday. Sure, Liebling’s an obscure figure—I guess at this point he’d be called a “cult writer,” and I just happen to be part of the cult, fan as I am of mid-twentieth-century magazine writing—but I’d’ve thought Bourdain would’ve aroused more interest. Anyway, the best comment was from Ethan, playing it straight and going for Liebling on the strength of his diversity of interests. Even though coming from the Eaters category, he can talk about lots of other topics; in that way, he’s similar to Steve Martin who broke out entirely from the Magicians category where he was situated. On the other side, the best comment in favor of Bourdain was from Sean, who endorsed the celebrity chef but said he went to one of Bourdain’s real-life talks but “left a little disappointed to hear what in large part amounted a collection of some of the best one-liners of No Reservations.”

For today we have the #2 ranked wit, the star of the Algonquin Round Table—no alcohol jokes in the comments, please—vs. the undisputed GOAT of gymnastics. Two completely different talents, and unfortunately only one can advance to the next round. Who should it be?

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Steve Martin (4) vs. David Letterman; Serena Williams advances

Yesterday‘s matchup featured a food writer vs. a tennis player, two professions that are not known for public speaking. The best arguments came in the very first two comments. Jeff wrote:

Fisher’s first book was “Serve It Forth,” which seems like good advice in tennis, as well. So, you’d get a two-fer there.

That was fine, but not as good as Jonathan’s endorsement of Williams:

Serena would be great at an academic seminar. Just like academics, she has a contempt for referees, even while purporting to regard them as valuable. Just don’t let the Chair interrupt her!

Which was echoed by Diana:

I was going to root for Fisher (whom I have never read) because her victory would make Auden happy. But then I thought about it some more and realized how incapable anyone is of *making* Auden happy—or unhappy, for that matter. In “The More Loving One,” he writes:

Were all stars to disappear or die,
I should learn to look at an empty sky
And feel its total dark sublime,
Though this might take me a little time.

So with that motive gone or suspended, I vote for Williams. She’s likely to win a few matches before the end, and that’ll be fun. At the seminar itself, she might even treat us to a serve or two (not to mention a referee chew-out, as Jonathan noted). What could go wrong?

Most of that bit was irrelevant, but I’m a sucker for Auden so I liked it anyway.

Today the competition is a bit more serious. Steve Martin is seeded #4 in the Magicians category even though magic is not one of his main talents; and David Letterman, though unseeded in the TV personalities category, knows how to handle an audience. You can take it from there.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Causal inference data challenge!

Susan Gruber, Geneviève Lefebvre, Tibor Schuster, and Alexandre Piché write:

The ACIC 2019 Data Challenge is Live!
Datasets are available for download (no registration required) at (bottom of the page).
Check out the FAQ at
The deadline for submitting results is April 15, 2019.

The fourth Causal Inference Data Challenge is taking place as part of the 2019 Atlantic Causal Inference Conference (ACIC) to be held in Montreal, Canada
( The data challenge focuses on computational methods of inferring causal effects from quasi-real world data. This year there are two tracks: low dimensional and high dimensional data. Participants will analyze 3200 datasets in either Track 1 or Track 2 to estimate marginal additive treatment effects and associated 95% confidence intervals. Entries will be evaluated with respect to bias, variance, mean squared error, and confidence interval coverage across a variety of data generating processes.

I’m not a big fan of 95% intervals, and I am aware of the general problems arising from this sort of competition: the problems in the contest are not necessarily similar to the problems to which a particular method might be applied. That said, Jennifer has assured me that she and others learned a lot from the results of previous competitions in this series, so on that basis I encourage all of you to take a look and check out this one.

Oscar Wilde (1) vs. Joe Pesci; the Japanese dude who won the hot dog eating contest advances

Raghuveer gave a good argument yesterday: “The hot dog guy would eat all the pre-seminar cookies, so that’s a definite no.” But this was defeated by the best recommendation we’ve ever had in the history of the Greatest Seminar Speaker contest, from Jeff:

Garbage In, Garbage Out: Mass Consumption and Its Aftermath
Takeru Kobayashi

Note: Attendance at both sessions is mandatory.

Best. Seminar. Ever.

So hot dog guy is set to go to the next round, against today’s victor.

It’s the wittiest man who ever lived, vs. an unseeded entry in the People from New Jersey category. So whaddya want: some 125-year-old jokes, or a guy who probably sounds like a Joe Pesci imitator? You think I’m funny? I’m funny how, I mean funny like I’m a clown, I amuse you?

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

Storytelling: What’s it good for?

A story can be an effective way to send a message. Anna Clemens explains:

Why are stories so powerful? To answer this, we have to go back at least 100,000 years. This is when humans started to speak. For the following roughly 94,000 years, we could only use spoken words to communicate. Stories helped us survive, so our brains evolved to love them.

Paul Zak of the Claremont Graduate University in California researches what stories do to our brain. He found that once hooked by a story, our brain releases oxytocin. The hormone affects our mood and social behaviour. You could say stories are a shortcut to our emotions.

There’s more to it; stories also help us remember facts. Gordon Bower and Michal Clark from Stanford University in California let two groups of subjects remember random nouns. One group was instructed to create a narrative with the words, the other to rehearse them one by one. People in the story group recalled the nouns correctly about six to seven times more often than the other group.

But my collaborator Thomas Basboll is skeptical:

It seems to me that a paper that has been written to mimic the most compelling features of Hollywood blockbusters (which Anna explicitly invokes) is also, perhaps unintentionally, written to avoid critical engagement. Indeed, when Anna talks about “characters” she does not mention the reader as a character in the story, even though the essential “drama” of any scientific paper stems from the conversation that reader and writer are implicitly engaged in. The writer is not simply trying to implant an idea in the mind of the reader. In a research paper, we are often challenging ideas already held and, crucially, opening our own thinking to those ideas and the criticism they might engender.

Basboll elaborates:

Anna promises that storytelling can produce papers that are “concise, compelling, and easy to understand”. But I’m not sure that a scientific paper should actually be compelling. . . . A scientific paper should be vulnerable to criticism; it should give its secrets away freely, unabashedly. And the best way to do that is, not to organise it with the aim of releasing oxytocin in the mind of the reader, but by clearly identifying your premises and your conclusions and the logic that connects them. You are not trying to bring your reader to a narrative climax. You are trying to be upfront about where your argument will collapse under the weight of whatever evidence the reader may bring to the conversation. Science, after all, is not so much about what Coleridge called “the suspension of disbelief” as what Merton called “organised skepticism”.

In our article from a few years ago, Basboll and I wrote about how we as scientists learn from stories. In discourse about science communication, stories are typically presented as a way for scientists to frame, explain, and promote their already-formed ideas; in our article, Basboll and I looked from a different direction, considering how it is that scientists can get useful information from stories. We concluded that stories are a form of model checking, that a good story expresses true information that contradicts some existing model of the world.

Basboll’s above exchange with Clemens is interesting in a different way: Clemens is saying that stories are an effective way to communicate because they compelling and memorable. Basboll replies that science shouldn’t always be compelling: so much of scientific work is mistakes, false starts, blind alleys, etc., so you want the vulnerabilities of any scientific argument to be clear.

The resolution, I suppose, is to use stories—but not in a way that hides the potential weaknesses of a scientific argument. Instead, harness the power of storytelling to make it easier for readers to spot the flaws.

The point is that there are two dimensions to scientific communication:

1. The medium of expression. Storytelling can be more effective than a dry sequence of hypothesis, data, results, conclusion.

2. The goal of communication. Instead of presenting a wrapped package of perfection, our explanation should have lots of accessible points: readers should be able to pull the strings so the arguments can unravel, if that is possible.

P.S. More on this from Basboll here.

How post-hoc power calculation is like a shit sandwich

Damn. This story makes me so frustrated I can’t even laugh. I can only cry.

Here’s the background. A few months ago, Aleksi Reito (who sent me the adorable picture above) pointed me to a short article by Yanik Bababekov, Sahael Stapleton, Jessica Mueller, Zhi Fong, and David Chang in Annals of Surgery, “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science,” which contained some reasonable ideas but also made a common and important statistical mistake.

I was bothered to see this mistake in an influential publication. Instead of blogging it, this time I decided to write a letter to the journal, which they pretty much published as is.

My letter went like this:

An article recently published in the Annals of Surgery states: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%---with the given sample size and effect size observed in that study”. This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power . . . The problem is well known in the statistical and medical literatures . . . That said, I agree with much of the content of [Bababekov et al.] . . . I appreciate the concerns of [Bababekov et al.] and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is just a problem with their recommendation to calculate power using observed effect sizes.

I was surgically precise, focusing on the specific technical error in their paper and separating this from their other recommendations.

And the letter was published, with no hassle! Not at all like my frustrating experience with the American Sociological Review.

So I thought the story was over.

But then my blissful slumber was interrupted when I received another email from Reito, pointing to a response in that same journal by Bababekov and Chang to my letter and others. Bababekov and Chang write:

We are greatly appreciative of the commentaries regarding our recent editorial . . .

So far, so good! But then:

We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. . . . We also respectfully disagree that knowing the power after the fact is not useful in surgical science.

No! My problem is not that their recommended post-hoc power calculations are “mathematically redundant”; my problem is that their recommended calculations will give wrong answers because they are based on extremely noisy estimates of effect size. To put it in statistical terms, their recommended method has bad frequency properties.

I completely agree with the authors that “knowing the power after the fact” can be useful, both in designing future studies and in interpreting existing results. John Carlin and I discuss this in our paper. But the authors’ recommended procedure of taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power. Not the same thing at all.

Here’s an example. Suppose you have 200 patients: 100 treated and 100 control, and post-operative survival is 94 for the treated group and 90 for the controls. Then the raw estimated treatment effect is 0.04 with standard error sqrt(0.94*0.06/100 + 0.90*0.10/100) = 0.04. The estimate is just one s.e. away from zero, hence not statistically significant. And the crudely estimated post-hoc power, using the normal distribution, is approximately 16% (the probability of observing an estimate at least 2 standard errors away from zero, conditional on the true parameter value being 1 standard error away from zero). But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away fro zero, corresponding to power being somewhere between 5% (if the true population effect size happened to be zero) and 97.5% (if the true effect size were three standard errors from zero). That’s what I call noisy.

Here’s an analogy that might help. Suppose someone offers me a shit sandwich. I’m not gonna want to eat it. My problem is not that it’s a sandwich, it’s that it’s filled with shit. Give me a sandwich with something edible inside; then we can talk.

I’m not saying that the approach that Carlin and I recommend—performing design analysis using substantively-based effect size estimates—is trivial to implement. As Bababekov and Chang write in their letter, “it would be difficult to adapt previously reported effect sizes to comparative research involving a surgical innovation that has never been tested.”

Fair enough. It’s not easy, and it requires assumptions. But that’s the way it works: if you want to make a statement about power of a study, you need to make some assumption about effect size. Make your assumption clearly, and go from there. Bababekov and Chang write: “As such, if we want to encourage the reporting of power, then we are obliged to use observed effect size in a post hoc fashion.” No, no, and no. You are not obliged to use a super-noisy estimate. You were allowed to use scientific judgment when performing that power analysis you wrote for your grant proposal, before doing the study, and you’re allowed to use scientific judgment when doing your design analysis, after doing the study.

The whole thing is so frustrating.

Look. I can’t get mad at the authors of this article. They’re doing their best, and they have some good points to make. They’re completely right that authors and researchers should not “misinterpret P > 0.05 to mean comparison groups are equivalent or ‘not different.’” This is an important point that’s not well understood; indeed my colleagues and I recently wrote a whole paper on the topic, actually in the context of a surgical example. Statistics is hard. The authors of this paper are surgeons, not statisticians. I’m a statistician and I don’t know anything about surgery; no reason to expect these two surgeons to know anything about statistics. But, it’s still frustrating.

P.S. After writing the above post a few months ago, I submitted it (without some features such as the “shit sandwich” line) as a letter to the editor of the journal. To its credit, the journal is publishing the letter. So that’s good.

New blog hosting!

Hi all. We’ve been having some problems with the blog caching, so that people were seeing day-old versions of the posts and comments. We moved to a new host and a new address,, and all should be better.

Still a couple glitches, though. Right now it doesn’t seem to be possible to comment. We hope to get that fixed soon (unfortunately it’s Friday evening and I don’t know if anyone’s gonna look at it over the weekend), will let you know when comments work again. Regularly scheduled posts will continue to appear.

Comments work too now!

Babe Didrikson Zaharias (2) vs. Adam Schiff; Sid Caesar advances

And our noontime competition continues . . .

We had some good arguments on both sides yesterday.

Jonathan writes:

In my experience, comedians are great when they’re on-stage and morose and unappealing off-stage. Sullivan, on the other hand, was morose and unappealing on-stage, and witty and charming off-stage, or so I’ve heard. This comes down, then, to deciding whether the speaker treats the seminar as a stage or not. I don’t think Sullivan would, because it’s not a “rilly big shew.”

That’s some fancy counterintuitive reasoning: Go with Sullivan because he won’t take it seriously so his pleasant off-stage personality will show up.

On the other hand, Zbicyclist goes with the quip:

Your Show of Shows -> Your Seminar of Seminars.

Render unto Caesar.

I like it. Sid advances.

For our next contest, things get more interesting. In one corner, the greatest female athlete of all time, an all-sport trailblazer. In the other, the chairman of the United States House Permanent Select Committee on Intelligence, who’s been in the news lately for his investigation of Russian involvement in the U.S. election. He knows all sorts of secrets.

If the seminar’s in the statistics department, Babe, no question. For the political science department, it would have to be Adam. But this is a university-wide seminar (inspired by this Latour-fest, remember?), so I think they both have a shot.

The post Babe Didrikson Zaharias (2) vs. Adam Schiff; Sid Caesar advances appeared first on Statistical Modeling, Causal Inference, and Social Science.

Ed Sullivan (3) vs. Sid Caesar; DJ Jazzy Jeff advances

Yesterday’s battle (Philip Roth vs. DJ Jazzy Jeff) was pretty low-key. It seems that this blog isn’t packed with fans of ethnic literature or hip-hop. Nobody in comments even picked up on my use of the line, “Does anyone know these people? Do they exist or are they spooks?” Isaac gave a good argument in favor of Roth: “Given how often Uncle Phil threw DJ Jazzy Jeff out of the house, it seems like he should win here,” but I’ll have to give it to Jazz, based on Jrc’s comment: “From what I hear, Roth was only like the 14th coolest Jew at Weequahic High School (which, by my math, makes him about the 28th coolest kid there). And we all know DJ Jazzy Jeff was the second coolest kid at Bel-Air Academy.” Good point.

Our next contest features two legendary TV variety show hosts who, at the very least, can tell first-hand stories about Elvis Presley, the Beatles, Mel Brooks, Woody Allen, and many others. Should be fun.

The full bracket is here, and here are the rules:

We’re trying to pick ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

The post Ed Sullivan (3) vs. Sid Caesar; DJ Jazzy Jeff advances appeared first on Statistical Modeling, Causal Inference, and Social Science.