2018: What really happened?

We’re always discussing election results on three levels: their direct political consequences, their implications for future politics, and what we can infer about public opinion.

In 2018 the Democrats broadened their geographic base, as we can see in this graph from Yair Ghitza:

Party balancing

At the national level, what happened is what we expected to happen two weeks ago, two months ago, and two years ago: the Democrats bounced back. Their average district vote for the House of Representatives increased by enough to give them clear control of the chamber, even in the face of difficulties of geography and partisan districting.

This was party balancing, which we talked about a few months ago: At the time of the election, the Republicans controlled the executive branch, both houses of congress, and the judiciary, so it made sense that swing voters were going to swing toward the Democrats. Ironically, one reason the Democrats did not regain the Senate in 2018 is . . . party balancing in 2016! Most people thought Hillary Clinton would win the presidency, so lots of people voted Republican for congress to balance that.

The swing in votes toward the Democrats was large (in the context of political polarization). As Nate Cohn wrote, the change in seats was impressive, given that there weren’t very many swing districts for the Democrats to aim for.

Meanwhile, as expected, the Senate remained in Republican control. Some close races went 51-49 rather than 49-51, which doesn’t tell us much about public opinion but is politically consequential.

Where did it happen?

The next question is geographic. Nationally, voters swung toward the Democrats. I was curious where this happened, so I did some googling and found this map by Josh Holder, Cath Levett, Daniel Levitt, and Peter Andringa:

This map omits districts that were uncontested in one election or the other so I suspect it understates the swing, but it gives the general idea.

Here’s another way to look at the swings.

Yair made a graph plotting the vote swing from 2016 to 2018, for each election contested in both years, plotting vs. the Democratic share of the two-party vote in 2016.

The result was pretty stunning—so much that I put the graph at the top of this post. So please scroll up and take a look again, then scroll back down here to keep reading.

Here’s the key takeaway. The Democrats’ biggest gains were in districts where the Republicans were dominant.

In fact, if you look at the graph carefully (and you also remember that we’re excluding uncontested elections, so we’re missing part of the story), you see the following:
– In strong Republican districts (D’s receiving less than 40% of the vote in 2016), Democrats gained almost everywhere, with an average gain of, ummm, it looks something like 8 percentage points.
– In swing districts (D’s receiving 40-60% of the vote in 2016), D’s improved, but only by about 4 percentage points on average. A 4% swing in the vote is a lot, actually! It’s just not 8%.
– In districts where D’s were already dominating, the results were, on average, similar to what happened in 2016.

I don’t know how much this was a national strategy and how much it just happened, but let me point out two things:

1. For the goal of winning the election, it would have been to the Democrats’ advantage to concentrate their gains in the zone where they’d received between 40 and 55% of the vote in the previous election. Conversely, these are the places where the Republicans would’ve wanted to focus their efforts too.

2. Speaking more generally, the Democrats have had a problem, both at the congressional and presidential levels, of “wasted votes”: winning certain districts with huge majorities and losing a lot of the closer districts. Thus, part of Democratic strategy has been to broaden their geographic base. The above scatterplot suggests that the 2018 election was a step in the right direction for them in this regard.

Not just a statistical artifact

When Yair sent me that plot, I had a statistical question: Could it be “regression to the mean”? We might expect, absent any election-specific information, that the D’s would improve in districts where they’d done poorly, and they’d decline in districts where they’d done well. So maybe I’ve just been overinterpreting a pattern that tells us nothing interesting at all?

To address this possible problem, Yair made two more graphs, repeating the above scatterplot, but showing the 2014-to-2016 shift vs. the 2014 results, and the 2012-to-2014 shift vs. the 2012 results. Here’s what he found:

So the answer is: No, it’s not regression to the mean, it’s not a statistical artifact. The shift from 2016 to 2018—the Democrats gaining strength in Republican strongholds—is real. And it can have implications for statewide and presidential elections as well. This is also consistent with results we saw in various special elections during the past two years.

The current narrative is wrong

As Yair puts it:

Current narrative: Dems did better in suburban/urban, Reps did better in rural, continuing trend from 2012-2016. I [Yair] am seeing the opposite.

This isn’t increasing polarization/sorting. This also isn’t mean-reversion. D areas stayed D, R areas jumped ship to a large degree. A lot of these are the rural areas that went the other way from 2012-2016.

Not just in CD races, also in Gov and Senate races . . . Massive, 20+ point shift in margin in Trump counties. Remember this is with really high turnout. . . . This offsets the huge shift from 2012-2016, often in the famous “Obama-Trump” counties.

Yair adds:

The reason people are missing this story right now: focusing on who won/lost means they’re looking at places that went from just below 50 to just above 50. Obviously that’s more important for who actually governs. But this is the big public opinion shift.

I suspect that part of this could be strategy from both parties.

– On one side, the Democrats knew they had a problem in big swathes of the country and they made a special effort to run strong campaigns everywhere: part of this was good sense given their good showing in special elections, and part of it was an investment in their future, to lay out a strong Democratic brand in areas that where they’ll need to be competitive in future statewide and presidential elections.

– On the other side, the Republicans had their backs to the wall and so they focused their effort on the seats they needed to hold if they had a chance of maintaining their House majority.

From that standpoint, the swings above do not completely represent natural public opinion swings from national campaigns. But they are meaningful: they’re real votes, and they’re in places where the Democrats need to gain votes in the future.

There are also some policy implications: If Democratic challengers are more competitive in previously solid Republican districts, enough so that the Republican occupants of these seats are more afraid of losing centrist votes in future general elections than losing votes on the right in future primaries, this could motivate these Republicans to vote more moderately in congress. I don’t know, but it seems possible.

I sent the above to Yair, and he added the following comments:

1. One important nuance is that this is conditional on competitiveness. In the House, some have suggested that Clinton/Trump 2016 is a better baseline than House 2016 vote, because that was a competitive race and many of these places weren’t competitive at all in 2016. When you use President 2016 as a baseline, it’s true that it looks more like a uniform swing. I don’t really buy this, though: (a) as has been widely reported, the Democrats’ explicit strategy was to broaden the map and try to run competitive candidates everywhere, because you never know which seats you’re gonna pick up. (b) Regardless of competitiveness, the fact remains that way more people in R areas voted for D’s this time. After nearly a decade of increasing polarization, many people saying there aren’t swing voters (which was always ridiculous), etc., this remains an important shift.

2. Related—in statewide races, the story is more complicated. Here are maps and plots of changes by county in the statewide races:

Much of the gain in R places was in non-competitive races, and some have commented that a lot of these places are always more moderate in statewide races. It’s not totally clear though, and this becomes a chicken-and-egg problem. A lot of the swing came in the Midwest, for example, in races that weren’t considered super-competitive in the end. But why not? Two years ago you might have thought that Sherrod Brown or Debbie Stabenow’s races in OH/MI could be up for grabs. They ended up being “not competitive” but maybe the causal arrow went: R defectors –> polling data shows a big lead –> people pull out. Frankly I don’t know the dynamics of all of these races well enough to say. But I do think it’s interesting that in some of the highest profile races, we did NOT see R defectors in the same way. I’m sure some will feel this means that, when push comes to shove and Rs put a lot of effort into a competitive campaign, we won’t necessarily see the same things.

3. Ceiling effects/wave dynamics—some people pointed out that you won’t see movement in Dem places in a Dem year because there’s nowhere to go. I don’t buy this. Reversion to the mean can be the expectation every time, and it hasn’t happened in a long time. In other words, I’m not sure where the ceiling is. Before the election, a scenario where the Democrats gained ground in urban areas and lost some in rural could have seemed perfectly plausible. Indeed that’s still the story being written to a large extent. I also do wonder whether familiarity with these statistical concepts (reversion to the mean, ceiling effects) makes smart people underestimate what happened on the ground—many Republicans in Republican areas voting for a Democrat for the first time in a long time. This is an important story as it goes against trends in increasing sorting/polarization.

The post 2018: What really happened? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Melanie Mitchell says, “As someone who has worked in A.I. for decades, I’ve witnessed the failure of similar predictions of imminent human-level A.I., and I’m certain these latest forecasts will fall short as well. “

Melanie Mitchell‘s piece, Artificial Intelligence Hits the Barrier of Meaning (NY Times behind limited paywall), is spot-on regarding the hype surrounding the current A.I. boom. It’s soon to come out in book length from FSG, so I suspect I’ll hear about it again in the New Yorker.

Like Professor Mitchell, I started my Ph.D. at the tail end of the first A.I. revolution. Remember, the one based on rule-based expert systems? I went to Edinburgh to study linguistics and natural language processing because it was strong in A.I., computer science theory, linguistics, and cognitive science.

On which natural language tasks can computers outperform or match humans? Search is good, because computers are fast and it’s a task at which humans aren’t so hot. That includes things like speech-based call routing in heterogeneous call centers (something I worked on at Bell Labs).

Then there’s spell checking. That’s fantastic. It leverages simple statistics about word frequency and typos/brainos and is way better than most humans at spelling. It’s the same algorithms that are used for speech recognition and RNA-seq alignment to the genome. These all sprung out of Claude Shannon’s 1948 paper, “A Mathematical Theory of Communication”, which has over 100K citations. It introduced, among other things, n-gram language models at the character and word level (still used for speech recognition and classification today with different estimators). As far as I know that paper contained the first posterior predictive checks—generating examples from the trained language models and comparing them to real language. David McKay’s info theory book (the only ML book I actually like) is a great introduction to this material and even BDA3 added a spell-checking example. But it’s hardly A.I. in the big “I” sense of “A.I.”.

Speech recognition has made tremendous strides (I worked on it at Bell Labs in the late 90s then at SpeechWorks in the early 00s), but its performance is still so far short of human levels as to make the difference more qualitative than quantitative, a point Mitchell makes in her essay. It would no more fool you into thinking it was human than an animatronic Disney character bolted to the floor. Unlike games like chess or go, it’s going to be hard to do better than people at language, but it would certainly be possible. But it would be hard to do that the same way they built, say Deep Blue, the IBM chess-playing hardware that evaluated so many gazillions of board positions per turn with very clever heuristics to prune search. That didn’t play chess like a human. If the better language was like that, humans wouldn’t understand it. IBM Watson (natural language Jeopardy playing computer) was closer to behaving like humans with its chain of associative reasoning—to me, that’s the closest we’ve gotten to something I’d call “A.I.”. It’s a shame IBM’s oversold it since then.

Human-level general purpose A.I. is going to be an incredibly tough nut to crack. I don’t see any reason it’s an unsurmounable goal. It’s not going to happen in a decade without a major breakthrough. Better classifiers just aren’t enough. People are very clever, insanely good at subtle chains of associative reasoning (though not so great at logic) and learning from limited examples (Andrew’s sister Susan Gelman, a professor at Michigan, studies concept learning by example). We’re also very contextually aware and focused, which allows us to go deep, but can cause us to miss the forest for the trees.

The post Melanie Mitchell says, “As someone who has worked in A.I. for decades, I’ve witnessed the failure of similar predictions of imminent human-level A.I., and I’m certain these latest forecasts will fall short as well. “ appeared first on Statistical Modeling, Causal Inference, and Social Science.

Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI

Here’s a great opportunity for those interested in probabilistic programming and workflows for Bayesian data analysis:

We (including me, Aki) are looking for outstanding postdoctoral researchers and research fellows to work for a new exciting project in the crossroads of probabilistic programming, simulator-based inference and user interfaces. You will have an opportunity to work with top research groups in Finnish Center for Artificial Intelligence, including both Aalto University and at the University of Helsinki and to cooperate with several industry partners.

The topics for which we are recruiting are

  • Machine learning for simulator-based inference
  • Intelligent user interfaces and techniques for interacting with AI
  • Interactive workflow support for probabilistic programming based modeling

Find the full descriptions here

The post Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI appeared first on Statistical Modeling, Causal Inference, and Social Science.

Cornell prof (but not the pizzagate guy!) has one quick trick to getting 1700 peer reviewed publications on your CV

From the university webpage:

Robert J. Sternberg is Professor of Human Development in the College of Human Ecology at Cornell University. . . . Sternberg is the author of over 1700 refereed publications. . . .

How did he compile over 1700 refereed publications? Nick Brown tells the story:

I [Brown] was recently contacted by Brendan O’Connor, a graduate student at the University of Leicester, who had noticed that some of the text in Dr. Sternberg’s many articles and chapters appeared to be almost identical. . . .

Exhibit 1 . . . this 2010 article by Dr. Sternberg was basically a mashup of this article of his from the same year and this book chapter of his from 2002. One of the very few meaningful differences in the chunks that were recycled between the two 2010 articles is that the term “school psychology” is used in the mashup article to replace “cognitive education” from the other; this may perhaps not be unrelated to the fact that the former was published in School Psychology International (SPI) and the latter in the Journal of Cognitive Education and Psychology (JCEP). If you want to see just how much of the SPI article was recycled from the other two sources, have a look at this. Yellow highlighted text is copied verbatim from the 2002 chapter, green from the JCEP article. You can see that about 95% of the text is in one or the other colour . . .

Brown remarks:

Curiously, despite Dr. Sternberg’s considerable appetite for self-citation (there are 26 citations of his own chapters or articles, plus 1 of a chapter in a book that he edited, in the JCEP article; 25 plus 5 in the SPI article), neither of the 2010 articles cites the other, even as “in press” or “manuscript under review”; nor does either of them cite the 2002 book chapter. If previously published work is so good that you want to copy big chunks from it, why would you not also cite it?

Hmmmmm . . . I have an idea! Sternberg wants to increase his citation count. So he cites himself all the time. But he doesn’t want people to know that he publishes essentially the same paper over and over again. So in those cases, he doesn’t cite himself. Cute, huh?

Brown continues:

Exhibit 2

Inspired by Brendan’s discovery, I [Brown] decided to see if I could find any more examples. I downloaded Dr. Sternberg’s CV and selected a couple of articles at random, then spent a few minutes googling some sentences that looked like the kind of generic observations that an author in search of making “efficient” use of his time might want to re-use. On about the third attempt, after less than ten minutes of looking, I found a pair of articles, from 2003 and 2004, by Dr. Sternberg and Dr. Elena Grigorenko, with considerable overlaps in their text. About 60% of the text in the later article (which is about the general school student population) has been recycled from the earlier one (which is about gifted children) . . .

Neither of these articles cites the other, even as “in press” or “manuscript in preparation”.

And there’s more:

Exhibit 3

I [Brown] wondered whether some of the text that was shared between the above pair of articles might have been used in other publications as well. It didn’t take long(*) to find Dr. Sternberg’s contribution (chapter 6) to this 2012 book, in which the vast majority of the text (around 85%, I estimate) has been assembled almost entirely from previous publications: chapter 11 of this 1990 book by Dr. Sternberg (blue), this 1998 chapter by Dr. Janet Davidson and Dr. Sternberg (green), the above-mentioned 2003 article by Dr. Sternberg and Dr. Grigorenko (yellow), and chapter 10 of this 2010 book by Dr. Sternberg, Dr. Linda Jarvin, and Dr. Grigorenko (pink). . . .

Once again, despite the fact that this chapter cites 59 of Dr. Sternberg’s own publications and another 10 chapters by other people in books that he (co-)edited, none of those citations are to the four works that were the source of all the highlighted text in the above illustration.

Now, sometimes one finds book chapters that are based on previous work. In such cases, it is the usual practice to include a note to that effect. And indeed, two chapters (numbered 26 and 27) in that 2012 book edited by Dr. Dawn Flanagan and Dr. Patti Harrison, contain an acknowledgement along the lines of “This chapter is adapted from . Copyright 20xx by . Adapted by permission”. But there is no such disclosure in chapter 6.

Exhibit 4

It appears that Dr. Sternberg has assembled a chapter almost entirely from previous work on more than one occasion. Here’s a recent example of a chapter made principally from his earlier publications. . . .

This chapter cites 50 of Dr. Sternberg’s own publications and another 7 chapters by others in books that he (co-)edited. . . .

However, none of the citations of that book indicate that any of the text taken from it is being correctly quoted, with quote marks (or appropriate indentation) and a page number. The four other books from which the highlighted text was taken were not cited. No disclosure that this chapter has been adapted from previously published material appears in the chapter, or anywhere else in the 2017 book . . .

In the context of a long and thoughtful discussion, James Heathers supplies the rules from the American Psychological Association code of ethics:

And here’s Cornell’s policy:

OK, that’s the policy for Cornell students. Apparently not the policy for faculty.

One more thing

Bobbie Spellman, former editor of the journal Perspectives on Psychological Science, is confident “beyond a reasonable doubt” that Sternberg was not telling the truth when he said that “all papers in Perspectives go out for peer review, including his own introductions and discussions.” Unless, as Spellman puts it, “you believe that ‘peer review’ means asking some folks to read it and then deciding whether or not to take their advice before you approve publication of it.”

So, there you have it. The man is obsessed with citing his own work—except on the occasions when he does a cut-and-paste job, in which case he is suddenly shy about mentioning his other publications. And, as editor, he reportedly says he sends out everything for peer review, but then doesn’t.

P.S. From his (very long) C.V.:

Sternberg, R. J. (2015). Epilogue: Why is ethical behavior challenging? A model of ethical reasoning. In R. J. Sternberg & S. T. Fiske (Eds.), Ethical challenges in the behavioral and brain sciences: Case studies and commentaries (pp. 218-226). New York: Cambridge University Press.

This guy should join up with Bruno Frey and Brad Bushman: the 3 of them would form a very productive Department of Cut and Paste. Department chair? Ed Wegman, of course.

The post Cornell prof (but not the pizzagate guy!) has one quick trick to getting 1700 peer reviewed publications on your CV appeared first on Statistical Modeling, Causal Inference, and Social Science.

“We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis”

Brendan Nyhan and Thomas Zeitzoff write:

The results do not provide clear support for the lack-of control hypothesis. Self-reported feelings of low and high control are positively associated with conspiracy belief in observational data (model 1; p<.05 and p<.01, respectively). We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis. Moreover, our experimental treatment effect estimate for our low-control manipulation is null relative to both the high-control condition (the preregistered hypothesis test) as well as the baseline condition (a RQ) in both the combined (table 2) and individual item results (table B7). Finally, we find no evidence that the association with self-reported feelings of control in model 1 of table 2 or the effect of the control treatments in model 2 are moderated by anti-Western or anti-Jewish attitudes (results available on request). Our expectations are thus not supported.

It is good to see researchers openly express their uncertainty and be clear about the limitations of their data.

The post “We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis” appeared first on Statistical Modeling, Causal Inference, and Social Science.

“Simulations are not scalable but theory is scalable”

Eren Metin Elçi writes:

I just watched this video the value of theory in applied fields (like statistics), it really resonated with my previous research experiences in statistical physics and on the interplay between randomised perfect sampling algorithms and Markov Chain mixing as well as my current perspective on the status quo of deep learning. . . .

So essentially in this post I give more evidence for [the] statements “simulations are not scalable but theory is scalable” and “theory scales” from different disciplines. . . .

The theory of finite size scaling in statistical physics: I devoted quite a significant amount of my PhD and post-doc research to finite size scaling, where I applied and checked the theory of finite size scaling for critical phenomena. In a nutshell, the theory of finite size scaling allows us to study the behaviour and infer properties of physical systems in thermodynamic limits (close to phase transitions) through simulating (sequences) of finite model systems. This is required, since our current computational methods are far from being, and probably will never be, able to simulate real physical systems. . . .

Here comes a question I have been thinking about for a while . . . is there a (universal) theory that can quantify how deep learning models behave on larger problem instances, based on results from sequences of smaller problem instances. As an example, how do we have to adapt a, say, convolutional neural network architecture and its hyperparameters to sequences of larger (unexplored) problem instances (e.g. increasing the resolution of colour fundus images for the diagnosis of diabetic retinopathy, see “Convolutional Neural Networks for Diabetic Retinopathy” [4]) in order to guarantee a fixed precision over the whole sequence of problem instances without the need of ad-hoc and manual adjustments to the architecture and hyperparameters for each new problem instance. A very early approach of a finite size scaling analysis of neural networks (admittedly for a rather simple “architecture”) can be found here [5]. An analogue to this, which just crossed my mind, is the study of Markov chain mixing times . . .

It’s so wonderful to learn about these examples where my work is inspiring young researchers to look at problems in new ways!

The post “Simulations are not scalable but theory is scalable” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

Fritz Strack points us to this article, “When Both the Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect,” by Tom Noah, Yaacov Schul, and Ruth Mayo, who write:

According to the facial-feedback hypothesis, the facial activity associated with particular emotional expressions can influence people’s affective experiences. Recently, a replication attempt of this effect in 17 laboratories around the world failed to find any support for the effect. We hypothesize that the reason for the failure of replication is that the replication protocol deviated from that of the original experiment in a critical factor. In all of the replication studies, participants were alerted that they would be monitored by a video camera, whereas the participants in the original study were not monitored, observed, or recorded. . . . we replicated the facial-feedback experiment in 2 conditions: one with a video-camera and one without it. The results revealed a significant facial-feedback effect in the absence of a camera, which was eliminated in the camera’s presence. These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.

We’ve discussed the failed replications of facial feedback before, so it seemed worth following up with this new paper that provides an explanation for the failed replication that preserves the original effect.

Here are my thoughts.

1. The experiments in this new paper are preregistered. I haven’t looked at the preregistration plan, but even if not every step was followed exactly, preregistration does seem like a good step.

2. The main finding is the facial feedback worked in the no-camera condition but not in the camera condition:

3. As you can almost see in the graph, the difference between these results is not itself statistically significant—not at the conventional p=0.05 level for a two-sided test. The result has a p-value of 0.102, which the authors describe as “marginally significant in the expected direction . . . . p=.051, one-tailed . . .” Whatever. It is what it is.

4. The authors are playing a dangerous game when it comes to statistical power. From one direction, I’m concerned that the studies are way too noisy: it says that their sample size was chosen “based on an estimate of the effect size of Experiment 1 by Strack et al. (1988),” but for the usual reasons we can expect that to be a huge overestimate of effect size, hence the real study has nothing like 80% power. From the other direction, the authors use low power to explain away non-statistically-significant results (“Although the test . . . was greatly underpowered, the preregistered analysis concerning the interaction . . . was marginally significant . . .”).

5. I’m concerned that the study is too noisy, and I’d prefer a within-person experiment.

6. In their discussion section, the authors write:

Psychology is a cumulative science. As such, no single study can provide the ultimate, final word on any hypothesis or phenomenon. As researchers, we should strive to replicate and/or explicate, and any one study should be considered one step in a long path. In this spirit, let us discuss several possible ways to explain the role that the presence of a camera can have on the facial-feedback effect.

That’s all reasonable. I think the authors should also consider the hypothesis that what they’re seeing is more noise. Their theory could be correct, but another possibility is that they’re chasing another dead end. This sort of thing can happen when you stare really hard at noisy data.

7. The authors write, “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.” I have no idea, but if this is true, it would definitely be good to know.

8. The treatments are for people to hold a pen in their lips or their teeth in some specified ways. It’s not clear to me why any effects of this treatments (assuming the effects end up being reproducible) should be attributed to facial feedback rather than some other aspect of the treatment such as priming or implicit association. I’m not saying there isn’t facial feedback going on; I just have no idea. I agree with the authors that their results are consistent with the facial-feedback model.

P.S. Strack also points us to this further discussion by E. J. Wagenmakers and Quentin Gronau, which I largely find reasonable, but I disagree with their statement regarding “the urgent need to preregister one’s hypotheses carefully and comprehensively, and then religiously stick to the plan.” Preregistration is fine, and I agree with their statement that generating fake data is a good way to test it out (one can also preregister using alternative data sets, as here), but I hardly see it as “urgent.” It’s just one part of the picture.

The post Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.” appeared first on Statistical Modeling, Causal Inference, and Social Science.