The causal hype ratchet

Noah Haber informs us of a research article, “Causal language and strength of inference in academic and media articles shared in social media (CLAIMS): A systematic review,” that he wrote with Emily Smith, Ellen Moscoe, Kathryn Andrews, Robin Audy, Winnie Bell, Alana Brennan, Alexander Breskin, Jeremy Kane, Mahesh Karra, Elizabeth McClure, and Elizabeth Suarez, and writes:

The study picked up the 50 most shared academic articles and their associated media articles about any exposure vs. some health outcome (i.e. “Chocolate is linked to Alzheimer’s”, “Going to the ER on a weekend is associated with higher mortality., etc.). We recruited a panel of 21 voluntary scientific reviewers from 6 institutions and multiple fields of study to to review these articles, using novel systematic review methods developed for this study.

We found that only 6% of studies exhibited strong causal inference, but that 20% of academic authors in this sample used language strongly implying causality. The most shared media articles about these studies overstated this evidence even further, and were likely to inaccurately describe the study and its implications.

This study picks up on a huge number of issues salient in science today, from publication-related biases, to issues in scientific reporting, all the way down to social media. While this study can’t identify the degree to which any specific factor is responsible, we can identify that by the time we are likely to see health science, it is extremely misleading.

A public-language summary of the study is available here.

I’ve not read the article in detail, but I thought it might interest some of you so I’m sharing it here. Their conclusion is in accord with my subjective experiences, that exaggerated claims slip in at every stage of the reporting process. Also, I don’t think we should only blame journalists for exaggerated claims in news articles and social media. Researchers often seem all too willing to spread the hype themselves.

The post The causal hype ratchet appeared first on Statistical Modeling, Causal Inference, and Social Science.

Exploring model fit by looking at a histogram of a posterior simulation draw of a set of parameters in a hierarchical model

Opher Donchin writes in with a question:

We’ve been finding it useful in the lab recently to look at the histogram of samples from the parameter combined across all subjects. We think, but we’re not sure, that this reflects the distribution of that parameter when marginalized across subjects and can be a useful visualization. It can be easier to interpret than the hyperparameters from which subjects are sampled and it is available in situations where the hyperparameters are not explicitly represented in the model.

I haven’t seen this being used much, and so I’m not confident that it is a reasonable thing to consider. I’m also not sure of my interpretation.

My reply:

Yes, I think this can make a lot of sense! We discuss an example of this technique on pages 155-157 of BDA3; see the following figures.

First we display a histogram of a draw from the posterior distribution of two sets of parameters in our model. Each histogram is a melange of parameters from 30 participants in the study.

The histograms do not look right; there is a conflict between these inferences and the prior distributions.

So we altered the model. Here are the corresponding histograms of the parameters under the new model:

These histograms seem like a good fit to the assumed prior (population) distributions in the new model.

The example comes from this 1998 article with Michel Meulders, Iven Van Mechelen, and Paul De Boeck.

The post Exploring model fit by looking at a histogram of a posterior simulation draw of a set of parameters in a hierarchical model appeared first on Statistical Modeling, Causal Inference, and Social Science.

When “nudge” doesn’t work: Medication Reminders to Outcomes After Myocardial Infarction

Gur Huberman points to this news article by Aaron Carroll, “Don’t Nudge Me: The Limits of Behavioral Economics in Medicine,” which reports on a recent study by Kevin Volpp et al. that set out “to determine whether a system of medication reminders using financial incentives and social support delays subsequent vascular events in patients following AMI compared with usual care”—and found no effect:

A compound intervention integrating wireless pill bottles, lottery-based incentives, and social support did not significantly improve medication adherence or vascular readmission outcomes for AMI survivors.

That said, there were some observed differences between the two groups, most notably:

Mean (SD) medication adherence did not differ between control (0.42 [0.39]) and intervention (0.46 [0.39]) (difference, 0.04; 95% CI, −0.01 to 0.09; P = .10).

An increase in adherence from 42% to 46% ain’t nothing, but, yes, a null effect is also within the margin of error. And, in any case, 46% adherence is not so impressive.

Here’s Carroll:

A thorough review published in The New England Journal of Medicine about a decade ago estimated that up to two-thirds of medication-related hospital admissions in the United States were because of noncompliance . . . To address the issue, researchers have been trying various strategies . . . So far, there hasn’t been much progress. . . . A more recent Cochrane review concluded that “current methods of improving medication adherence for chronic health problems are mostly complex and not very effective.” . . .

He then describes the Volpp et al. study quoted above:

Researchers randomly assigned more than 1,500 people to one of two groups. All had recently had heart attacks. One group received the usual care. The other received special electronic pill bottles that monitored patients’ use of medication. . . .

Also:

Those patients who took their drugs were entered into a lottery in which they had a 20 percent chance to receive $5 and a 1 percent chance to win $50 every day for a year.

That’s not all. The lottery group members could also sign up to have a friend or family member automatically be notified if they didn’t take their pills so that they could receive social support. They were given access to special social work resources. There was even a staff engagement adviser whose specific duty was providing close monitoring and feedback, and who would remind patients about the importance of adherence.

But, Carroll writes:

The time to first hospitalization for a cardiovascular problem or death was the same between the two groups. The time to any hospitalization and the total number of hospitalizations were the same. So were the medical costs. Even medication adherence — the process measure that might influence these outcomes — was no different between the two groups.

This is not correct. There were, in fact, differences. But, yes, the differences were not statistically significant and it looks like differences of that size could’ve occurred by chance alone. So we can say that the treatment had no clear or large apparent effects.

Carroll also writes:

Maybe financial incentives, and behavioral economics in general, work better in public health than in more direct health care.

I have no idea why he is saying this. Also it’s not clear to me how he distinguishes “public health” from “direct health care.” He mentions weight loss and smoking cessation but these seem to blur the boundary, as they’re public health issues that are often addressed by health care providers.

Anyway, my point here is not to criticize Carroll. It’s an interesting topic. My quick thought on why nudges seem so ineffective here is that people must have good reasons for not complying—or they must think they have good reasons. After all, complying would seem to be a good idea, and it’s close to effortless, no? So if the baseline rate of compliance is really only 40%, maybe it would take a lot to convince those other 60% to change their behaviors.

It’s similar to the difficulty of losing weight or quitting smoking. It’s not that it’s so inherently hard to lose weight or to quit smoking; it’s that people who can easily lose weight or quit smoking have already done so, and it’s the tough cases that remain. Similarly, the people for whom it’s easy to convince to comply . . . they’re already complying with the treatment. The noncompliers are a tougher nut to crack.

The post When “nudge” doesn’t work: Medication Reminders to Outcomes After Myocardial Infarction appeared first on Statistical Modeling, Causal Inference, and Social Science.

Comparing racism from different eras: If only Tucker Carlson had been around in the 1950s he could’ve been a New York Intellectual.

TV commentator Carlson in 2018 recently raised a stir by saying that immigration makes the United States “poorer, and dirtier, and more divided,” which reminded me of this rant from literary critic Alfred Kazin in 1957:

Screen Shot 2013-03-16 at 6.12.03 PM

Kazin put it in his diary and Carlson broadcast it on TV, so not quite the same thing.

But this juxtaposition made me think of Keith Ellis’s comment that “there’s much less difference between conservatives and progressives than most people think. Maybe one or two generations of majority opinion, at most.”

When people situate themselves on political issues, I wonder how much of this is on the absolute scale and how much is relative to current policies or the sense of the prevailing opinion. Is Tucker Carlson more racist than Alfred Kazin? Does this question even make sense? Maybe it’s like comparing baseball players from different eras, e.g. Mike Trout vs. Babe Ruth as hitters. Or, since we’re on the topic of racism, Ty Cobb vs. John Rocker.

The post Comparing racism from different eras: If only Tucker Carlson had been around in the 1950s he could’ve been a New York Intellectual. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Why do sociologists (and bloggers) focus on the negative? 5 possible explanations. (A post in the style of Fabio Rojas)

Fabio Rojas asks why the academic field of sociology seems so focused on the negative. As he puts it, why doesn’t the semester begin with the statement, “Hi, everyone, this is soc 101, the scientific study of society. In this class, I’ll tell you about how American society is moving in some great directions as well as some lingering problems”?

Rojas writes:

If sociology is truly a broad social science, and not just the study “social problems,” then we might encourage more research into the undeniably positive improvements in human well being.

This suggestion interests me, in part because on this blog we are often negative. We sometimes write about cool new methods or findings in statistical modeling, causal inference, and social science, but we also spend a lot of time on the negative. And it’s not just us; it’s my impression that blogs in general have a lot of negativity, in the same way that movie reviews are often negative. Even if a reviewer likes a movie, he or she will often take some space to point out possible areas of improvements. And many of the most-remembered reviews are slams.

Rather than getting into a discussion of whether blogs, or academic sociology, or movie reviews, should be more positive or negative, let’s get into the more interesting question of Why.

Why is negativity such a standard response? Let me try to answer in Rojas style:

1. Division of labor. Within social science, sociology’s “job” is to confront us with the bad news, to push us to study inconvenient truths. If you want to hear good news, you can go listen to the economists. Similarly, blogs took the “job” of criticizing the mainstream media (and, later, the scientific establishment); it was a niche that needed filling. If you want to be a sociologist or blogger and focus on the good things, that’s fine, but you’ll be atypical. Explanation 1 suggests that sociologists (and bloggers, and movie reviewers) have adapted to their niches in the intellectual ecosystem, and that each field has the choice of continuing to specialize or to broaden by trying to occupy some of the “positivity” space occupied by other institutions.

2. Efficient allocation of resources. Where can we do the most good? Reporting positive news is fine, but we can do more good by focusing on areas of improvement. I think this is somewhat true, but not always. Yes, it’s good to point out where people can do better, but we can also do good by understanding how good things happen. This is related to the division-of-labor idea above, or it could be considered an example of comparative advantage.

3. Status. Sociology doesn’t have the prestige of economics (more generally, social science doesn’t have the prestige of the natural sciences); blogs have only a fraction of the audience of the mass media (and we get paid even less for blogging then they get paid for their writing); and movie reviewers, of course, are nothing but parasites on the movie industry. So maybe we are negative for emotional reasons—to kick back at our social superiors—or for strategic reasons, to justify our existence. Either way, these are actions of insecure people in the middle, trying to tear down the social structure and replace it with a new one where they’re at the top. This is kind of harsh and it can’t fully be true—how, for example, would it explain that even the sociologists who are tenured professors at top universities still (presumably) focus on the bad news, or that even star movie reviewers can be negative—but maybe it’s part of the way that roles and expectations are established and maintained.

4. Urgency. Psychiatrists work with generally-healthy people as well as the severely mentally ill. But caring for the sickest is the most urgent: these are people who are living miserable lives, or who pose danger to themselves and others. Similarly (if on a lesser scale of importance), we as social scientists might feel that progress will continue on its own, while there’s no time to wait to fix serious social ills. Similarly, as a blogger, I might not bother saying much about a news article that was well reported, because the article itself did a good job of sending its message. But it might seem more urgent to correct an error. Again, this is not always good reasoning—it could be that understanding a positive trend and keeping it going is more urgent than alerting people to a problem—but I think this may be one reason for a seeming focus on negativity. As Auden put it,

To-morrow, perhaps the future. The research on fatigue
And the movements of packers; the gradual exploring of all the
Octaves of radiation;
To-morrow the enlarging of consciousness by diet and breathing.

To-morrow the rediscovery of romantic love,
the photographing of ravens; all the fun under
Liberty’s masterful shadow;
To-morrow the hour of the pageant-master and the musician,

The beautiful roar of the chorus under the dome;
To-morrow the exchanging of tips on the breeding of terriers,
The eager election of chairmen
By the sudden forest of hands. But to-day the struggle.

5. Man bites dog. Failures are just more interesting to write about, and to read about, than successes. We’d rather hear the story of “secrets and lies in a Silicon Valley startup,” than hear the boring story of a medical device built by experienced engineers and sold at a reasonable price. Hence the popularity within social science (not just sociology!) of stories of the form, Everything looks like X but not Y; the popularity among bloggers of Emperor’s New Clothes narratives; and the popularity among movie reviewers of, This big movie isn’t all that. You will occasionally get it the other way—This seemingly bad thing is really good—but it’s generally in the nature of contrarian takes to be negative, because they’re reacting to some previous positive message coming from public relations and the news media.

Finally, some potential explanations that I don’t think really work: Laziness. Maybe it’s less effort to pick out things to complain about then to point out good news. I don’t think so. When it comes to society, as Rojas notes in his post, there are lots of positive trends to point out. Similar, science is full of interesting papers—open up just about any journal and look for the best, most interesting ideas—and there are lots of good movies too. Rewards. You get more credit, pay, and glory for being negative than positive. Again, I don’t think so. Sure, there are the occasional examples such as H. L. Mencken, but I think the smoother path to career success is to say positive things. Pauline Kael, for example, had some memorable pans but I’d say her characteristic stance was enthusiasm. For every Thomas Frank there are three Malcolm Gladwells (or so I say based on my unscientific guess), and it’s the Gladwells who get more of the fame and fortune. Personality. Sociologists, bloggers, and reviewers are, by and large, malcontents. They grumble about things cos that’s what they do, and whiny people are more likely to gravitate to these activities. OK, maybe so, but this doesn’t really explain why negativity is concentrated in these fields and media rather than others. The “personality” explanation just takes us back to our first explanation, “division of labor.”

And, yes, I see the irony that this post, which is all about why sociologists and bloggers are so negative, has been sparked by a negative remark made by a sociologist on a blog. And I’m sure you will have some negative things to say in the comments. After all, the only people more negative than bloggers, are blog commenters!

The post Why do sociologists (and bloggers) focus on the negative? 5 possible explanations. (A post in the style of Fabio Rojas) appeared first on Statistical Modeling, Causal Inference, and Social Science.

Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists”

Teppo Felin sends along this article with Mia Felin, Joachim Krueger, and Jan Koenderink on “surprise-hacking,” and writes:

We essentially see surprise-hacking as the upstream, theoretical cousin of p-hacking. Though, surprise-hacking can’t be resolved with replication, more data or preregistration. We use perception and priming research to make these points (linking to Kahneman and priming, Simons and Chabris’s famous gorilla study and its interpretation, etc).

We think surprise-hacking implicates theoretical issues that haven’t meaningfully been touched on – at least in the limited literatures that we are aware of (mostly in cog sci, econ, psych). Though, there are probably related literatures out there (which you are very likely to know) – so I’m curious if you are aware of papers in other domains that deal with this or related issues?

I think the point that Felin et al. are making is that results obtained under conditions of surprise might not generalize to normal conditions. The surprise in the experiment is typically thought of as a mechanism for isolating some phenomenon—part of the design of the experiment—but arguably is it one of the conditions of the experiment as well. Thus, the conclusion of a study conducted under surprise should not be, “People show behavior X,” but rather, “People show behavior X under a condition of surprise.”

Regarding Felin’s question to me: I am not aware of any discussion of this issue in the political science literature, but maybe there’s something out there, or perhaps something related? All I can think of right now is experiments on public opinion and voting, where there is some discussion of relevance of isolated experiments to real-world behavior when people are subject to many influences.

I’ll conclude with a line from Felin et al.’s paper:

The narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists.


I’m reminded of the two modes of reasoning in pop-microeconomics: (1) People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist, or (2) People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

They get you coming and going, and the common thread is that they know best. The message is that we are all foolish fools and we need the experts’ expertise for life-hacks that will change our lives.

If we step back a bit further, we can associate this with a general approach to social science, or science in general, which is to focus on “puzzles” or anomalies to our existing theories. From a Popperian/Lakatosian perspective, it makes sense to gnaw on puzzles and to study the counterintuitive. The point, though, is that the blindness and illusion is a property of researchers—after all, the point is to investigate phenomena that don’t fit with our scientific models of the world—as of the people being studied. It’s not so much that people are predictably irrational, but that existing scientific theories are wrong in some predictable ways.

The post Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists” appeared first on Statistical Modeling, Causal Inference, and Social Science.

“My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…”

Youyou Wu writes:

I’m a postdoc studying scientific reproducibility. I have a machine learning question that I desperately need your help with. My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…

I’m trying to predict whether a study can be successfully replicated (DV), from the texts in the original published article. Our hypothesis is that language contains useful signals in distinguishing reproducible findings from irreproducible ones. The nuances might be blind to human eyes, but can be detected by machine algorithms.

The protocol is illustrated in the following diagram to demonstrate the flow of cross-validation. We conducted a repeated three-fold cross-validation on the data.

STEP 1) Train a doc2vec model on the training data (2/3 of the data) to convert raw texts into vectors representing language features (this algorithm is non-deterministic, the models and the outputs can be different even with the same input and parameter)
STEP 2) Infer vectors using the doc2vec model for both training and test sets
STEP 3) Train a logistic regression using the training set
STEP 4) Apply the logistic regression to the test set, generate a predicted probability of success

Because doc2vec is not deterministic, and we have a small training sample, we came up with two choices of strategies:

(1) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was done once with sample A as the test set, and a combined sample of B and C as the training set, generating on predicted probability for each study in sample A. To generate probabilities for the entire sample, Step 1 through 4 was repeated two more times, setting sample B or C as the test set respectively. At this moment, we had one predicted probability for each study. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by same three-fold cross-validation. A new probability was generated for each study this time. The whole procedure was iterated 100 times, so each study had 100 different probabilities. We averaged the probabilities and compared the average probabilities with the ground truth to generate a single AUC score.

(2) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was first repeated 100 times with sample A as the test set, and a combined sample of B and C as the training set, generating 100 predicted probabilities for each study in sample A. As I said, these 100 probabilities are different because doc2vec isn’t deterministic. We took the average of these probabilities and treated that as our final estimate for the studies. To generate average probabilities for the entire sample, each group of 100 runs was repeated two more times, setting sample B or C as the test set respectively. An AUC was calculated upon completion, between the ground truth and the average probabilities. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by the same 3×100 runs of modeling, generating a new AUC. The whole procedure was iterated on 100 different shuffles, and an AUC score was calculated each time. We ended up having a distribution of 100 AUC scores.

I personally thought strategy two is better because it separates variation in accuracy due to sampling from the non-determinism of doc2vec. My advisor thought strategy one is better because it’s less computationally intensive and produce better results, and doesn’t have obvious flaws.

My first thought is to move away from the idea of declaring a study as being “successfully replicated.” Better to acknowledge the continuity of the results from any study.

Getting to the details of your question on cross-validation: Jeez, this really is complicated. I keep rereading your email over and over again and getting confused each time. So I’ll throw this one out to the commenters. I hope someone can give a useful suggestion . . .

OK, I do have one idea, and that’s to evaluate your two procedures (1) and (2) using fake-data simulation: Start with a known universe, simulate fake data from that universe, then apply procedures (1) and (2) and see if they give much different answers. Loop the entire procedure and see what happens, comparing your cross-validation results to the underlying truth which in this case is assumed known. Fake-data simulation is the brute-force approach to this problem, and perhaps it’s a useful baseline to help understand your problem.

The post “My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…” appeared first on Statistical Modeling, Causal Inference, and Social Science.

A couple of thoughts regarding the hot hand fallacy fallacy

For many years we all believed the hot hand was a fallacy. It turns out we were all wrong. Fine. Such reversals happen.

Anyway, now that we know the score, we can reflect on some of the cognitive biases that led us to stick with the “hot hand fallacy” story for so long.

Jason Collins writes:

Apart from the fact that this statistical bias slipped past everyone’s attention for close to thirty years, I [Collins] find this result extraordinarily interesting for another reason. We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

Also I was thinking a bit more about the hot hand, in particular a flaw in the underlying logic of Gilovich etc (and also me, before Miller and Sanjurjo convinced me about the hot hand): The null model is that each player j has a probability p_j of making a given shot, and that p_j is constant for the player (considering only shots of some particular difficulty level). But where does p_j come from? Obviously players improve with practice, with game experience, with coaching, etc. So p_j isn’t really a constant. But if “p” varies among players, and “p” varies over the time scale of years or months for individual players, why shouldn’t “p” vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?

I can see that “constant probability for any given player during a one-year period” is a better model than “p varies wildly from 0.2 to 0.8 for any player during the game.” But that’s a different story. The more I think about the “there is no hot hand” model, the more I don’t like it as any sort of default.

In any case, it’s good to revisit our thinking about these theories in light of new arguments and new evidence.

The post A couple of thoughts regarding the hot hand fallacy fallacy appeared first on Statistical Modeling, Causal Inference, and Social Science.

Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.

I have a sad story for you today.

Jason Collins tells it:

In The (Honest) Truth About Dishonesty, Dan Ariely describes an experiment to determine how much people cheat . . . The question then becomes how to reduce cheating. Ariely describes one idea:

We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever.

Sounds pretty impressive! But these things all sound impressive when described at some distance from the data.

Anyway, Collins continues:

This experiment has now been subject to a multi-lab replication by Verschuere and friends. The abstract of the paper:

. . . Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory . . . moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. . . .

And what happened? It’s in the graph above (from Verschuere et al., via Collins). The average estimated effect was tiny, it was not conventionally “statistically significant” (that is, the 95% interval included zero), and it “was numerically in the opposite direction of the original study.”

As is typically the case, I’m not gonna stand here and say I think the treatment had no effect. Rather, I’m guessing it has an effect which is sometimes positive and sometimes negative; it will depend on person and situation. There doesn’t seem to be any large and consistent effect, that’s for sure. Which maybe shouldn’t surprise us. After all, if the original finding was truly a surprise, then we should be able to return to our original state of mind, when we did not expect this very small intervention to have such a large and consistent effect.

I promised you a sad story. But, so far, this is just one more story of a hyped claim that didn’t stand up to the rigors of science. And I can’t hold it against the researchers that they hyped it: if the claim had held up, it would’ve been an interesting and perhaps important finding, well worth hyping.

No, the sad part comes next.

Collins reports:

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.

That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.

You can read the response and judge for yourself. I think Collins’s report is accurate, and that’s what made me sad. These people care enough about this topic to conduct a study, write it up in a research article and then in a book—but they don’t seem to care enough to seriously entertain the possibility they were mistaken. It saddens me. Really, what’s the point of doing all this work if you’re not going to be open to learning?

And there’s no need to think anything done in the first study was unethical at the time. Remember Clarke’s Law.

Another way of putting it is: Ariely’s book is called “The Honest Truth . . .” I assume Ariely was honest when writing this book; that is, he was expressing sincerely-held views. But honesty (and even transparency) are not enough. Honesty and transparency supply the conditions under which we can do good science, but we still need to perform good measurements and study consistent effects. The above-discussed study failed in part because of the old, old problem that they were using a between-person design to study within-person effects; see here and here. (See also this discussion from Thomas Lumley on a related issue.)

P.S. Collins links to the original article by Mazar, Amir, and Ariely. I guess that if I’d read it in 2008 when it appeared, I’d’ve believed all its claims too. A quick scan shows no obvious problems with the data or analyses. But there can be lots of forking paths and unwittingly opportunistic behavior in data processing and analysis; recall the 50 Shades of Gray paper (in which the researchers performed their own replication and learned that their original finding was not real) and its funhouse parody 64 Shades of Gray paper, whose authors appeared to take their data-driven hypothesizing all too seriously. The point is: it can look good, but don’t trust yourself; do the damn replication.

P.P.S. This link also includes some discussions, including this from Scott Rick and George Loewenstein:

In our opinion, the main limitation of Mazar, Amir, and Ariely’s article is not in the perspective it presents but rather in what it leaves out. Although it is important to understand the psychology of rationalization, the other factor that Mazar, Amir, and Ariely recognize but then largely ignore—namely, the motivation to behave dishonestly—is arguably the more important side of the dishonesty equation. . . .

A closer examination of many of the acts of dishonesty in the real world reveals a striking pattern: Many, if not most, appear to be motivated by the desire to avoid (or recoup) losses rather than the simple desire for gain. . . .

The feeling of being in a hole not only originates from nonshareable unethical behavior but also can arise, more prosaically, from overly ambitious goals . . . Academia is a domain in which reference points are particularly likely to be defined in terms of the attainments of others. Academia is becoming increasingly competitive . . . With standards ratcheting upward, there is a kind of “arms race” in which academics at all levels must produce more to achieve the same career gains. . . .

An unfortunate implication of hypermotivation is that as competition within a domain increases, dishonesty also tends to increase in response. Goodstein (1996) feared as much over a decade ago:

. . . What had always previously been a purely intellectual competition has now become an intense competition for scarce resources. This change, which is permanent and irreversible, is likely to have an undesirable effect in the long run on ethical behavior among scientists. Instances of scientific fraud are almost sure to become more common.

Rick and Loewenstein were ahead of their time to be talking about all that, back in 2008. Also this:

The economist Andrei Shleifer (2004) explicitly argues against our perspective in an article titled “Does Competition Destroy Ethical Behavior?” Although he endorses the premise that competitive situations are more likely to elicit unethical behavior, and indeed offers several examples other than those provided here, he argues against a psychological perspective and instead attempts to show that “conduct described as unethical and blamed on ‘greed’ is sometimes a consequence of market competition” . . .

Shleifer (2004) concludes optimistically, arguing that competition will lead to economic growth and that wealth tends to promote high ethical standards. . . .

Wait—Andrei Shleifer—wasn’t he involved in some scandal? Oh yeah:

During the early 1990s, Andrei Shleifer headed a Harvard project under the auspices of the Harvard Institute for International Development (HIID) that invested U.S. government funds in the development of Russia’s economy. Schleifer was also a direct advisor to Anatoly Chubais, then vice-premier of Russia . . . In 1997, the U.S. Agency for International Development (USAID) canceled most of its funding for the Harvard project after investigations showed that top HIID officials Andre Shleifer and Jonathan Hay had used their positions and insider information to profit from investments in the Russian securities markets. . . . In August 2005, Harvard University, Shleifer and the Department of Justice reached an agreement under which the university paid $26.5 million to settle the five-year-old lawsuit. Shleifer was also responsible for paying $2 million worth of damages, though he did not admit any wrongdoing.

In the above quote, Shleifer refers to “conduct described as unethical” and puts “greed” in scare quotes. No way Shleifer could’ve been motivated by greed, right? After all, he was already rich, and rich people are never greedy, or so I’ve heard.

Anyway, that last bit is off topic; still, it’s interesting to see all these connections. Cheating’s an interesting topic, even though (or especially because) it doesn’t seem that it can be be turned on and off using simple behavioral interventions.

The post Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Time series of Democratic/Republican vote share in House elections

Yair prepared this graph of average district vote (imputing open seats at 75%/25%; see here for further discussion of this issue) for each House election year since 1976:

Decades of Democratic dominance persisted through 1992; since then the two parties have been about even.

As has been widely reported, a mixture of geographic factors and gerrymandering have given Republicans the edge in House seats in recent years (most notably in 2012 where they retained control even after losing the national vote), but if you look at aggregate votes it’s been a pretty even split.

The above graph also shows that the swing in 2018 was pretty big: not as large as the historic swings in 1994 and 2010, but about the same as the Democratic gains in 2006 and larger than any other swing in the past forty years.

See here and here for more on what happened in 2018

The post Time series of Democratic/Republican vote share in House elections appeared first on Statistical Modeling, Causal Inference, and Social Science.