Stephen Wolfram explains neural nets

It’s easy to laugh at Stephen Wolfram, and I don’t like some of his business practices, but he’s an excellent writer and is full of interesting ideas. This long introduction to neural network prediction algorithms is an example. I have no idea if Wolfram wrote this book chapter himself or if he hired one of his paid theorem-provers to do it—I guess it’s probably some sort of collaboration—but it doesn’t really matter. It all looks really cool.

The post Stephen Wolfram explains neural nets appeared first on Statistical Modeling, Causal Inference, and Social Science.

“And when you did you weren’t much use, you didn’t even know what a peptide was”

Last year we discussed the story of an article, “Variation in the β-endorphin, oxytocin, and dopamine receptor genes is associated with different dimensions of human sociality,” published in PNAS that, notoriously, misidentified what a peptide was, among other problems.

Recently I learned of a letter published in PNAS by Patrick Jern, Karin Verweij, Fiona Barlow, and Brendan Zietsch, with the no-fooling-around title, “Reported associations between receptor genes and human sociality are explained by methodological errors and do not replicate.”

And here’s the response by one of the authors, Robin Dunbar, entitled “Sorry, we got it wrong” “On asking the right questions.”

Too bad they couldn’t simply admit they made an error, stating clearly and without equivocation that their original conclusions were not substantiated. On the plus side, they weren’t as rude as these authors.

P.S. The other thing in that post was that I suggested to PNAS that they change their slogan from “PNAS publishes only the highest quality scientific research” to “PNAS aims to publish only the highest quality scientific research.” And they did it! So cool.

The post “And when you did you weren’t much use, you didn’t even know what a peptide was” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Multilevel models for multiple comparisons! Varying treatment effects!

Mark White writes:

I have a question regarding using multilevel models for multiple comparisons, per your 2012 paper and many blog posts. I am in a situation where I do randomized experiments, and I have a lot of additional demographic information about people, as well. For the moment, let us just assume that all of these are categorical demographic variables. I want to not only know if there is an effect of the treatment over the control—but for what groups there is an effect (positive or negative) for. I never get too granular, but I do look at an intersection between two variables (e.g., Black men, younger married people, Republican women) as well as just within one variable (e.g., women, Republicans, married people).

The issue I’m running into is that I want to look at the effects for all of these groups, but I don’t want to get mired down by Type I error and go chasing noise. (I know you reject the Type I error paradigm because a null of precisely zero is a straw-man argument, but clients and other stakeholders still want to be sure we aren’t reading too much into something that is not there.)

In the machine learning literature, there is a growing interest in causal inference and now a whole topic called “heterogeneous treatment effects.” In the general linear model world in which I was taught as a psychologist, this could also just be called “looking for interactions.” Many of these methods are promising, but I’m finding them difficult to implement in my scenario (I wrote a question here and posed a tailored question about one package to package creators directly here

Turning back to multilevel models, it seems like I could do this in that framework. Basically, I just create a non-nested/crossed/whatever you’d like to call it model where people are nested in k groups, where k refers to how many demographic variables I have. I simulated data and fit a model here:

The questions I have for you are the questions I pose at the bottom of that R script at the GitHub code snippet:

1. Is this a reasonable approach to examine “heterogenous treatment effects” without getting bogged down by Type I error and multiple comparison problems?

2. If it is, how can I get confidence intervals from the fitted model object using glmer? You all do so in the 2012 paper, I believe

3. More importantly, how can I look at the intersection between two groups? The code I sent in that GitHub snippet looks at effects for men, women, Blacks, Whites, millennials, etc. But I coded in an effect for Black men specifically. How could I use that fitted model object to examine the effect for Black men, White women, millennials with kids, etc.? And how would I calculate standard errors for these?

4. Would all of these things be easier to do in Stan? What would that Stan model look like? Since then I wouldn’t have to figure out how to calculate standard errors for everything, but just sample from the posterior.

My reply:

We’ve been talking about varying treatment effects for a long time. (“Heterogeneous” is jargon for “varying,” I think.)

From 2004: Treatment effects in before-after data.

From 2008: Estimating incumbency advantage and its variation, as an example of a before/after study.

From 2015: The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

From 2015: Hierarchical models for causal effects.

From 2015: The connection between varying treatment effects and the well-known optimism of published research findings.

From 2017: Let’s accept the idea that treatment effects vary—not as something special but just as a matter of course.

I definitely think hierarchical modeling is the way to go here. Think of it as a regression model, in which you’re modeling (predicting) treatment effects given pre-treatment predictors, so the treatment could be more effective for men than for women, or for young people than for old people, etc. You’ll end up with lots of predictors in this regression, and multilevel modeling is a way to control or regularize their coefficients.

In short, the key virtue of multilevel modeling (or some other regularization approach) here is that it allows you to include more predictors in your regression. Without regularization, your estimates would become too noisy, then you’d have to fit a cruder model, not allowing you to study the variation that you care about.

The other thing is, yeah, forget type 1 error rates and all the rest. Abandon the idea that the goal of the statistical analysis is to get some sort of certainty. Instead, accept posterior ambiguity: don’t try to learn more from the data than you really can.

I’ll start with some models in lme4 (or rstanarm) notation. Suppose you have a treatment z and pre-treatment predictors x1 and x2. Then here are some models:

y ~ z + x1 + x2 # constant treatment effect
y ~ z + x1*z + x2*z # treatment can vary by x1 and x2
y ~ z + x1*x2*z # also include interaction of x1 and x2

If you have predictors x3 and x4 with multiple levels:

y ~ z + x1 + x2 + (1 | x3) + (1 | x4) # constant treatment effect
y ~ z + x1*z + x2*z + (1 + z | x3) + (1 + z | x4) # varying treatment effect
y ~ z + x1*z + x2*z + (1 + z | x3*x4) # includes an interaction

One thing we’re still struggling with, is that there are all these possible models. Really we’d like to start and end with the full model, something like this, with all the interactions:

y ~ (1 + x1*x2*z | x3*x4)

But these models can be hard to handle. I think we need stronger priors, stronger than the current defaults in rstanarm. So for now I’d build up from the simple model, including interactions as appropriate.

In any case, you can get posterior uncertainties for whatever you want from stan_glmer() in rstanarm; simulations of all the parameters are directly accessible from the fitted object.

You can also aggregate however you want. It’s mathematically the same as Mister P; you’re just working with treatment effects rather than averages.

The post Multilevel models for multiple comparisons! Varying treatment effects! appeared first on Statistical Modeling, Causal Inference, and Social Science.

$ vs. votes

Carlos Cruz writes:

Here’s an economics joke. Two economists are walking along when they happen to end up in front of a Tesla showroom. One economist points to a shiny new car and says, “I want that!” The other economist replies, “You’re lying.”

The premise of this joke is that if the one economist had truly wanted the car then he would have walked into the showroom and bought it. The reason that he didn’t do so is because he evidently has more important things to spend his money on.

The basic economic problem is…

Society’s wants: unlimited
Society’s resources: limited

How people divide their limited dollars accurately reflects how they truly want society’s limited resources to be divided. This is the basic premise of the market.

From the perspective of biology, costly signals are credible signals.

The same is true from at least one perspective in psychology…

“If a woman told us that she loved flowers, and we saw that she forgot to water them, we would not believe in her “love” for flowers. Love is the active concern for the life and the growth of that which we love. Where this active concern is lacking, there is no love.” – Erich Fromm, The Art of Loving

Unlike spending, voting is a cheap signal, so it’s extremely curious that surveys primarily rely on voting instead of spending. There are precious few exceptions to this rule. Here are the ones that I know of…

1. Donating was used to determine whether men or women are better tippers.

2. Donating was used to determine which prominent skeptic to prank.

3. Donating was used to determine which theme to use for the libertarian convention…

$6,327.00 — I’m That Libertarian!
$5,200.00 — Building Bridges, Not Walls
$1,620.00 — Pro Choice on Everything
$1,377.77 — Empowering the Individual
$395.00 — The Power of Principle
$150.00 — Future of Freedom
$135.00 — Life, Liberty and the Pursuit of Happiness
$105.00 — Rise of the Libertarians
$75.00 — Free Lives Matter
$42.00 — Be Me, Be Free
$17.76 — Make Taxation Theft Again
$15.42 — Taxation is Theft
$15.00 — Jazzed About Liberty
$15.00 — All of Your Freedoms, All of the Time
$5.00 — Am I Being Detained!
$5.00 — Liberty Here and Now

Do you know of any others?

How different would the results have been if voting had been used instead of spending? For example, would the theme “Taxation is Theft” have been ranked higher or lower?

Voting and spending are very different things, so they must create very different hierarchies. If you search Google for “hierarchy”… the hierarchy of the results is determined by voting. Each link to a page counts as a vote for it. Google got this idea from how scholarly papers are ranked by citations. Each citation counts as a vote. All the videos on Youtube are also primarily ranked by voting. The only reason that so many people are talking about Jordan Peterson these day is because his video about pronouns received so many votes. His fans, at no cost to themselves, propelled him to prominence. Thanks to his lofty pedestal he now earns around $50,000/month on Patreon.

At the grocery store, on the other hand, we really don’t simply vote for our favorite products. Instead, we use our money to help rank them. The hierarchy of food is determined by spending. Same with clothes, cars and computers.

Voting and spending are used to rank many different things… yet their relative effectiveness has not been formally tested. I wish that I could effectively articulate the absurdity of this situation. What makes it especially absurd is that it wouldn’t be very costly to conduct a decent experiment. It’s not like it would be necessary to spend $5 billion dollars to build a particle accelerator.

Imagine if a bunch of college students rank some books. Here are some potential books…

The Origin Of Species
Harry Potter and the Sorcerer’s Stone
The Handmaid’s Tale
A Tale of Two Cities
50 Shades of Grey
The Bible
War and Peace
A Theory of Justice
The Cat in the Hat
The Wealth of Nations
The Hunger Games

One group of students would use voting to rank them, while another would use spending. To be clear, the spenders wouldn’t be buying the books. They would simply be using their money to reveal the amount of love they have for each book. All the money they spent would help crowdfund the experiment.

How differently would voting and spending rank these books? Which hierarchy would be better? Which hierarchy would be closer to your own?

My guess is that voting elevates trash while spending elevates treasure. All the biggest improvements to the mainstream are naturally going to come from the margins. So, basically, spending is far better than voting at facilitating the most beneficial social evolution.

Imagine if the same experiment was conducted with beer. Would the students be willing to spend more money to rank beer than books? Colleges could be graded accordingly!

A group of people will have collectively tried a greater variety of beer than any single member of the group. When members compare their own beer preferences to the group’s beer preferences, naturally they will notice the disparities. Then the goal is to identify whether the individual or the group is mistaken. In some cases it will be the individual, in other cases it will be the group. The information that individuals share with the group will largely be for the purpose of eliminating errors.

Spending is the best way for detrimental disparities to be discovered and dispatched. Using another alliteration… earmarking is the examination that will most efficiently eliminate errors.

Here’s Karl Popper…

“If I am standing quietly, without making any movement, then (according to the physiologists) my muscles are constantly at work, contracting and relaxing in an almost random fashion, but controlled, without my being aware of it, by error-elimination so that every little deviation from my posture is almost at once corrected. So I am kept standing, quietly, by more or less the same method by which an automatic pilot keeps an aircraft steadily on its course.” — Karl Popper, Of Clouds and Clocks

Here’s Adam Smith…

“It is thus that the private interests and passions of individuals naturally dispose them to turn their stocks towards the employments which in ordinary cases are most advantageous to the society. But if from this natural preference they should turn too much of it towards those employments, the fall of profit in them and the rise of it in all others immediately dispose them to alter this faulty distribution. Without any intervention of law, therefore, the private interests and passions of men naturally lead them to divide and distribute the stock of every society among all the different employments carried on in it as nearly as possible in the proportion which is most agreeable to the interest of the whole society.” — Adam Smith, Wealth of Nations

As far as I know, even though this passage is perfectly relevant to Popper’s point, he never referred to it. I’m guessing that he wasn’t even aware of it.

We all have very limited perspectives so it’s way too easy to overlook important things. However, our perspectives aren’t equally limited, which is why it’s so beneficial to know the group’s perspective. Are voting and spending equally effective at revealing the group’s perspective? I’m guessing that spending is far more effective… but I could be wrong.

My reply:

There are two questions here, the measurement question and the policy question.

1. The measurement question: How best to measure people’s likes, goals, desires, etc.? Should you ask people what they want, or should you ask them to spend money to convey what they want? This can be considered as a problem in survey measurement, and the best method of measurement will depend on the context. As you note, if the question is, How much are you willing to spend on a car?, it makes more sense to see what people actually spend, than to ask them what they would do? If the question is, What book do you prefer?, then the best approach I think would not be to ask them, nor would it be to ask them to spend money, but rather to see what books they actually read. If the question is, What’s a good slogan?, then I don’t see the point of asking people to spend money, as I don’t see how this relates at all to the larger goals.

2. The policy question: I’ve heard this said before, that voting is a cheap signal and so should not be taken seriously. Oddly enough, the people who make this argument often also make the argument that voting is irrational so people shouldn’t do it. But, of course, voting is only irrational to the extent that it’s not cheap. I’m not particularly sympathetic to the “voting is irrational” argument. Regarding the argument that voting is a cheap signal which we should not use to make decisions: I guess the question here is, what is the alternative? Voting by dollars has the obvious problem that people with more dollars get more votes.

The post $ vs. votes appeared first on Statistical Modeling, Causal Inference, and Social Science.

“Economic predictions with big data” using partial pooling

Tom Daula points us to this post, “Economic Predictions with Big Data: The Illusion of Sparsity,” by Domenico Giannone, Michele Lenza, and Giorgio Primiceri, and writes:

The paper wants to distinguish between variable selection (sparse models) and shrinkage/regularization (dense models) for forecasting with Big Data. “We then conduct Bayesian inference on these two crucial parameters—model size and the degree of shrinkage.” This is similar to your recent posts on the two-way interaction of machine learning and Bayesian inference, as well as to your multiple comparisons paper. The conclusion is the data indicate variable selection is bad, a zero coefficient with zero variance is too strong. My intuition is that the results are not surprising since the data favoring exactly 0 is so unlikely, but I assume the paper fleshes out the nuance (or explains why that intuition is wrong).

Here is the abstract:

We compare sparse and dense representations of predictive models in macroeconomics, microeconomics, and finance. To deal with a large number of possible predictors, we specify a prior that allows for both variable selection and shrinkage. The posterior distribution does not typically concentrate on a single sparse or dense model, but on a wide set of models. A clearer pattern of sparsity can only emerge when models of very low dimension are strongly favored a priori.

I [Daula] haven’t read the paper yet, but noticed the priors while skimming.

(p.4) “The priors for the low dimensional parameters φ and σ^2 are rather standard, and designed to be uninformative.” The coefficient vector phi has a flat prior which you’ve shown is not uninformative and the probability of σ^2 is inversely proportional to σ^2 (no idea where that comes from, but nothing like what you recommend in the Stan documentation).

The overall setup seems reasonable, but I’m curious how you would set it up if you had your druthers.

My quick response is that I’m sympathetic to the argument of Giannone et al., as it’s similar to something I wrote a few years ago, Whither the “bet on sparsity principle” in a nonsparse world? Regarding more specific questions of modeling: Yes, I think they should be able to do better than uniform or other purportedly noninformative priors. When it comes to methods for variable selection and partial pooling, I guess I’d recommend the regularized horseshoe from Juho Piironen and Aki Vehtari.

The post “Economic predictions with big data” using partial pooling appeared first on Statistical Modeling, Causal Inference, and Social Science.

These 3 problems destroy many clinical trials (in context of some papers on problems with non-inferiority trials, or problems with clinical trials in general)

Paul Alper points to this news article in Health News Review, which says:

A news release or story that proclaims a new treatment is “just as effective” or “comparable to” or “as good as” an existing therapy might spring from a non-inferiority trial.

Technically speaking, these studies are designed to test whether an intervention is “not acceptably worse” in terms of its effectiveness than what’s currently used. . . .

These trials have proliferated as drug and device makers find it harder to improve upon existing treatments. So instead, they devise products they hope work just as well but with an extra benefit, such as more convenient dosing, lower cost, or fewer side effects.

If a company can show its product is just as effective as the current standard treatment but with an added perk, it might gain a marketing edge.

Sounds like no problem so far: Why not have some drug that performs as well as its competitor but is better in some secondary way?

But the article continues:

Problem is, the studies used to generate that edge often aren’t considered trustworthy.

Generally speaking, non-inferiority trials are considered less credible than a more common trial design, the superiority trial, which determines whether one treatment outperforms another treatment or a placebo. That’s because non-inferiority trials are often based on murky assumptions that could favor the new product being tested.

Rarely do non-inferiority trials conclude that a new treatment is not non-inferior . . . That scarcity of negative findings “raises the provocative questions of whether industry-sponsored non-inferiority trials offer any value—aside from capturing market share,” wrote Vinay Prasad, MD, in an editorial in the Journal of Internal Medicine entitled “Non-Interiority Trials in Medicine: Practice Changing or a Self-Fulfilling Prophecy?”

In a separate concern, ethical issues have been raised about whether some non-inferiority trials should be conducted at all, because they might expose patients to potentially worse treatments in order to advance a commercial goal.

From an article, “Non-inferiority trials: are they inferior? A systematic review of reporting in major medical journals,” by Sunita Rehal et al.:

Reporting and conduct of non-inferiority trials is inconsistent and does not follow the recommendations in available statistical guidelines, which are not wholly consistent themselves.

There’s a lot of discussion of “type 1 error rate,” which I don’t care about. True effects, or population differences, are never zero.

The general point is that non-inferiority trials, like clinical trials in general, can be gamed, and they are gamed.

The way it looks to me is that non-inferiority trials do have a lot of problems, and that these are problems that “regular” clinical trials have also. The problems include:
1. A statistical framework that is focused on the uninteresting question of zero true effect and zero systematic error,
2. A desire and an expectation to come up with certain conclusions from noisy data,
3. Incentives to cheat.

Regarding point 2: it’s worse than you might think. It’s not just that “statistical significance” is typically taken as tantamount to a certain claim that a treatment is effective. It’s also that non-significance is commonly taken as a certain claim that a treatment has no effect (see for example our discussion of stents). Since every result is either statistically significant or not, this gives you automatic certainty, no matter what the data are!

P.S. Full disclosure: I’ve had business relationships with Novartis, Astrazeneca and other drug companies.

The post These 3 problems destroy many clinical trials (in context of some papers on problems with non-inferiority trials, or problems with clinical trials in general) appeared first on Statistical Modeling, Causal Inference, and Social Science.

The evolution of pace in popular movies

James Cutting writes:

Movies have changed dramatically over the last 100 years. Several of these changes in popular English-language filmmaking practice are reflected in patterns of film style as distributed over the length of movies. In particular, arrangements of shot durations, motion, and luminance have altered and come to reflect aspects of the narrative form. Narrative form, on the other hand, appears to have been relatively unchanged over that time and is often characterized as having four more or less equal duration parts, sometimes called acts – setup, complication, development, and climax. The altered patterns in film style found here affect a movie’s pace: increasing shot durations and decreasing motion in the setup, darkening across the complication and development followed by brightening across the climax, decreasing shot durations and increasing motion during the first part of the climax followed by increasing shot durations and decreasing motion at the end of the climax. . . .

I’m fascinated by the topic, and I love the idea of people studying storytelling in a systematic way.

People might also be interested in this other paper by Cutting, Narrative theory and the dynamics of popular movies:

Using a corpus analysis I explore a physical narratology of popular movies—narrational structure and how it impacts us—to promote a theory of popular movie form. I show that movies can be divided into 4 acts—setup, complication, development, and climax—with two optional subunits of prolog and epilog, and a few turning points and plot points. . . . In general, movie narratives have roughly the same structure as narratives in any other domain—plays, novels, manga, folktales, even oral histories—but with particular runtime constraints, cadences, and constructions that are unique to the medium.

Here’s one of the patterns he found:

P.S. Excellent surname for someone who studies the construction of films.

The post The evolution of pace in popular movies appeared first on Statistical Modeling, Causal Inference, and Social Science.

Hey! There are mathematicians out there who’ve never read Proofs and Refutations. Whassup with that??

I ran into a colleague the other day who’d never read Proofs and Refutations (full title: Proofs and Refutations: The Logic of Mathematical Discovery). He’d never even heard of it!

The post Hey! There are mathematicians out there who’ve never read Proofs and Refutations. Whassup with that?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

“She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.”

Robert Wiblin writes:

If we have a study on the impact of a social program in a particular place and time, how confident can we be that we’ll get a similar result if we study the same program again somewhere else?

Dr Eva Vivalt . . . compiled a huge database of impact evaluations in global development – including 15,024 estimates from 635 papers across 20 types of intervention – to help answer this question.

Her finding: not confident at all.

The typical study result differs from the average effect found in similar studies so far by almost 100%. That is to say, if all existing studies of an education program find that it improves test scores by 0.5 standard deviations – the next result is as likely to be negative or greater than 1 standard deviation, as it is to be between 0-1 standard deviations.

She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.

Wiblin continues:

For researchers hoping to figure out what works and then take those programs global, these failures of generalizability and ‘external validity’ should be disconcerting.

Is ‘evidence-based development’ writing a cheque its methodology can’t cash?

Should we invest more in collecting evidence to try to get reliable results?

Or, as some critics say, is interest in impact evaluation distracting us from more important issues, like national economic reforms that can’t be tested in randomised controlled trials?

Wiblin also points to this article by Mary Ann Bates and Rachel Glennerster who argue that “rigorous impact evaluations tell us a lot about the world, not just the particular contexts in which they are conducted” and write:

If researchers and policy makers continue to view results of impact evaluations as a black box and fail to focus on mechanisms, the movement toward evidence-based policy making will fall far short of its potential for improving people’s lives.

I agree with this quote from Bates and Gellenerst, and I think the whole push-a-button, take-a-pill, black-box attitude toward causal inference has been a disastrous mistake. I feel particularly bad about this, given that econometrics and statistics textbooks, including my own, have been pushing this view for decades.

Stepping back a bit, I agree with Vivalt that, if we want to get a sense of what policies to enact, it can be a mistake to try to be making these decisions based on the results of little experiments. There’s nothing wrong with trying to learn from demonstration studies (as here), but generally I think realism is more important than randomization. And, when effects are highly variable and measurements are noisy, you can’t learn much even from clean experiments.

The post “She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

A Bayesian take on ballot order effects

Dale Lehman sends along a paper, “The ballot order effect is huge: Evidence from Texas,” by Darren Grant, which begins:

Texas primary and runoff elections provide an ideal test of the ballot order hypothesis, because ballot order is randomized within each county and there are many counties and contests to analyze. Doing so for all statewide offices contested in the 2014 Democratic and Republican primaries and runoffs yields precise estimates of the ballot order effect across twenty-four different contests. Except for a few high-profile, high-information races, the ballot order effect is large, especially in down-ballot races and judicial positions. In these, going from last to first on the ballot raises a candidate’s vote share by nearly ten percentage points.

Lehman writes:

This is related to a number of themes you have repeatedly blogged about. While I have not checked the methodology in detail, among the points I think are relevant are:

– If the order listed on the ballot can have such a large impact, how does that jibe with statements you have made concerning unbelievably large claims of influences on voting behavior (shark attacks, etc.)?

– If a Bayesian analysis were being conducted, what would the prior look like? It seems that the issue has been researched before, but this study seems to find substantially larger effects – but based on an arguably better sample and better methodology. Should we raise the burden of proof for this study based on prior results?

– The study uses null hypothesis significance testing throughout. While I generally agree with your criticisms of that methodology, it is not clear to me why/how it is inappropriate in the attached study. In other words, it seems to me that the null of no ballot order effect might be a reasonable way to proceed.

My reply:

It definitely makes sense that ballot-order effects are larger in minor races. Indeed, last year I assessed a claim by political scientist Jon Krosnick that Trump’s 2016 election was determined by ballot order effects in Michigan, Wisconsin, and Florida. For ballot order to have made the difference, it would’ve had to have caused a 1.2% swing in Florida. My best guess is that the effect would not have been that large. However, in the absence of ballot order effects, I think the election would’ve been much closer, so it’s fair enough to include ballot order among the several effects that, put together, made the difference in that close race.

To return to the Grant paper under discussion: Just at a technical level, there are some problems (for example: “Because ballot order is randomly determined, the most natural approach for analyzing the data would appear to be analysis of variance, with ballot order being the “treatment.” However, because vote shares are proportions, the assumptions for an analysis of variance are not met.”). But Grant ended up running a regression, which would be the standard analysis anyway, so we can set aside that digression. In addition, I think null hypothesis testing is irrelevant here, because nobody doubts there is some nonzero ballot order effect. The only question is how large is it in different elections; there’s no interest in testing its pure existence.

What I really want to see are some scatterplots. Without any sense of the data, I have to put more trust in the analysis. And I was starting to get confused about some of the details here. With a randomized treatment effect, I’d think you’d start by just regressing vote share for each candidate on ballot order, displaying those coefficients, then going from there. Maybe a hierarchical model or whatever. Actually what they did was “A separate regression is estimated for each candidate but one; we adopt the convention of omitting the candidate who received the fewest votes.”—I can’t figure out why they didn’t do it with all the candidates. Also this: “This system is estimated with seemingly unrelated regression, to account for those inter-candidate correlations in popularity that are not captured by our controls”—maybe this is a good idea, I have no idea. And this: “the weight equals one half the base-10 logarithm of the number of voters”: this might work, but why the “one half”? That won’t affect the weights.

Anyway, my point here is not to bash the paper. I don’t really know enough to say whether a ballot order effect of 10 percentage points is plausible or not. As I said, it’s hard for me to evaluate a claim when the analysis is complicated and I can’t see the data. I expect that others who are interested in pursuing the topic will be able to try out different model specifications and explore further.

The post A Bayesian take on ballot order effects appeared first on Statistical Modeling, Causal Inference, and Social Science.