Thinking about “Abandon statistical significance,” p-values, etc.

We had some good discussion the other day following up on the article, “Retire Statistical Significance,” by Valentin Amrhein, Sander Greenland, and Blake McShane.

I have a lot to say, and it’s hard to put it all together, in part because my collaborators and I have said much of it already, in various forms.

For now I thought I’d start by listing my different thoughts in a short post while I figure out how best to organize all of this.


There’s also the problem that these discussions can easily transform into debates. After proposing an idea and seeing objections, it’s natural to then want to respond to those objections, then the responders respond, etc., and the original goals are lost.

So, before going on, some goals:

– Better statistical analyses. Learning from data in a particular study.

– Improving the flow of science. More prominence to reproducible findings, less time wasted chasing noise.

– Improving scientific practice. Changing incentives to motivate good science and demotivate junk science.

Null hypothesis testing, p-values, and statistical significance represent one approach toward attaining the above goals. I don’t think this approach works so well anymore (whether it did in the past is another question), but the point is to keep these goals in mind.

Some topics to address

1. Is this all a waste of time?

The first question to ask is, why am I writing about this at all? Paul Meehl said it all fifty years ago, and people have been rediscovering the problems with statistical-significance reasoning every decade since, for example this still-readable paper from 1985, The Religion of Statistics as Practiced in Medical Journals, by David Salsburg, which Richard Juster sent me the other day. And, even accepting the argument that the battle is still worth fighting, why don’t I just leave this in the capable hands of Amrhein, Greenland, McShane, and various others who are evidently willing to put in the effort?

The short answer is I think I have something extra to contribute. So far, my colleagues and I have come up with some new methods and new conceptualizations—I’m thinking of type M and type S errors, the garden of forking paths, the backpack fallacy, the secret weapon, “the difference between . . .,” the use of multilevel models to resolve the multiple comparisons problem, etc. We haven’t been just standing on the street corner the past twenty years, screaming “Down with p-values; we’ve been reframing the problem in interesting and useful ways.

How did we make these contributions? Not out of nowhere, but as a byproduct of working on applied problems, trying to work things out from first principles, and, yes, reading blog comments and answering questions from randos on the internet. When John Carlin and I write an article like this or this, for example, we’re not just expressing our views clearly and spreading the good word. We’re also figuring out much of it as we go along. So, when I see misunderstanding about statistics and try to clean it up, I’m learning too.

2. Paradigmatic examples

It could be a good idea to list the different sorts of examples that are used in these discussions. Here are a few that keep coming up:
The clinical trial comparing a new drug to the standard treatment. “Psychological Science” or “PNAS”-style headline-grabbing unreplicable noise mining. Gene-association studies. Regressions for causal inference from observational data. Studies with multiple outcomes. Descriptive studies such as in Red State Blue State.

I think we can come up with more of these. My point here is that different methods can work for different examples, so I think it makes sense to put a bunch of these cases in one place so the argument doesn’t jump around so much. We can also include some examples where p-values and statistical significance don’t seem to come up at all. For instance, MRP to estimate state-level opinion from national surveys: nobody’s out there testing which states are statistically significantly different from others. Another example is item-response or ideal-point modeling in psychometrics or political science: again, these are typically framed as problems of estimation, not testing.

3. Statistics and computer science as social sciences

We’re used to statistical methods being controversial, with leading statisticians throwing polemics at each other regarding issues that are both theoretically fundamental and also core practical concerns. The fighting’s been going on, in different ways, for about a hundred years!

But here’s a question. Why is it that statistics is so controversial? The math is just math, no controversy there. And the issues aren’t political, at least not in a left-right sense. Statistical controversies don’t link up in any natural way to political disputes about business and labor, or racism, or war, or whatever.

In its deep and persistent controversies, statistics looks less like the hard sciences and more like the social sciences. Which, again, seems strange to me, given that statistics is a form of engineering, or applied math.

Maybe the appropriate point of comparison here is not economics or sociology, which have deep conflicts based on human values, but rather computer science. Computer scientists can get pretty worked up about technical issues which to me seem unresolvable: the best way to structure a programming language, for example. I don’t like to label these disputes as “religious wars,” but the point is that the level of passion often seems pretty high, in comparison to the dry nature of the subject matter.

I’m not saying that passion is wrong! Existing statistical methods have done their part to slow down medical research: lives are at stake. Still, stepping back, the passion in statistical debates about p-values seems a bit more distanced from the ultimate human object of concern, compared to, say the passion in debates about economic redistribution or racism.

To return to the point about statistics and computer science: These two fields fundamentally are about how they are used. A statistical method or a computer ultimately connects to a human: someone has to decide what to do. So they both are social sciences, in a way that physics, chemistry, or biology are not, or not as much.

4. Different levels of argument

The direct argument in favor of the use of statistical significance and p-values is that it’s desirable to use statistical procedures with so-called type 1 error control. I don’t buy that argument because I think that selecting on statistical significance yields noisy conclusions. To continue the discussion further, I think it makes sense to consider particular examples, or classes of examples (see item 2 above). They talk about error control, I talk about noise, but both these concepts are abstractions, and ultimately it has to come down to reality.

There are also indirect arguments. For example: 100 million p-value users can’t be wrong. Or: Abandoning statistical significance might be a great idea, but nobody will do it. I’d prefer to have the discussion at the more direct level of what’s a better procedure to use, with the understanding that it might take awhile for better options to become common practice.

5. “Statistical significance” as a lexicographic decision rule

This is discussed in detail in my article with Blake McShane, David Gal, Christian Robert, and Jennifer Tackett:

[In much of current scientific practice], statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap 2010; Bem 2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise . . . We propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

6. Confirmationist and falsificationist paradigms of science

I wrote about this a few years ago:

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

It is my impression that in the vast majority of cases, “statistical significance” is used in confirmationist way. To put it another way: the problem is not just with the p-value, it’s with the mistaken idea that falsifying a straw-man null hypothesis is evidence in favor of someone’s pet theory.

7. But what if we need to make an up-or-down decision?

This comes up a lot. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically-fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Regarding the economics, the point that we made in section 4.4 of our paper is that decisions are not currently made in an automatic way. Papers are reviewed by hand, one at a time.

As Peter Dorman puts it:

The most important determinants of the dispositive power of statistical evidence should be its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.

To put it another way, there are two issues here: (a) the potential benefits of an automatic screening or decision rule, and (b) using a p-value (null-hypothesis tail area probability) for such a rule. We argue against using screening rules (or, to use them much less often). But in the cases where screening rules are desired, we see no reason to use p-values for this.

8. What should we do instead?

To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.

As Eric Loken and I put it:

Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.

For a couple more examples, consider the two studies discussed in section 2 of this article. For both of them, nothing is gained and much is lost by passing results through the statistical significance filter.

Again, the use of standard errors and uncertainty intervals is not just significance testing in another form. The point is to use these uncertainties as a way of contextualizing estimates, not to declare things as real or not.

The next step is to recognize multiplicity in your problem. Consider this paper, which contains many analyses but not a single p-value or even a confidence interval. We are able to assess uncertainty by displaying results from multiple polls. Yes, it is possible to have data with no structure at all—a simple comparison with no replications—and for these, I’d just display averages, variation, and some averages and uncertainties—but this is rare, as such simple comparisons are typically part of a stream of results in a larger research project.

One can and should continue with multilevel models and other statistical methods that allow more systematic partial pooling of information from different sources, but the secret weapon is a good start.


My current plan to write this all up as a long article, Unpacking the Statistical Significance Debate and the Replication Crisis, and put it on Arxiv. That could reach people who don’t feel like engaging with blogs.

In the meantime, I’d appreciate your comments and suggestions.

Impact of published research on behavior and avoidable fatalities

In a paper entitled, “Impact of published research on behavior and avoidable fatalities,” Addison Kramer, Alexandra Kirk, Faizaan Easton, and Bertram Hester write:

There has long been speculation of an “informational backfire effect,” whereby the publication of questionable scientific claims can lead to behavioral changes that are counterproductive in the aggregate. Concerns of informational backfire have been raised in many fields that feature an intersection of research and policy, including education, medicine, or nutrition—but it has been difficult to study this effect empirically because of confounding of the act of publication with the effects of the research ideas in question through other pathways. In the present paper we estimate the informational backfire effect using a unique identification strategy based on the timing of publication of high-profile articles in well-regarded scientific journals. Using measures of academic citation, traditional media mentions, and social media penetration, we show, first, that published claims backed by questionable research practices receive statistically significantly wider exposure, and, second, that this exposure leads to large and statistical significant aggregate behavioral changes, as measured by a regression discontinuity analysis. The importance of this finding can be seen using a case study in the domain of alcohol consumption, where we demonstrate that publication of research papers claiming a safe daily dose is linked to increased drinking and higher rates of drunk driving injuries and fatalities, with the largest proportional increases occurring in states with the highest levels of exposure to news media science and health reporting.

I don’t know how much to believe all this, as there are the usual difficulties of studying small effects using aggregate data—the needle-in-a-haystack problem—and I’d like to see the raw data. But in any case I wanted to share this with you, as it relates to various discussions we’ve had such as here, for example. Also this relates to general questions we’ve had regarding the larger effects of scientific research on our thoughts and behaviors.

Here’s an idea for not getting tripped up with default priors . . .

I put this in the Prior Choice Recommendations wiki awhile ago:

“The prior can often only be understood in the context of the likelihood”:

Here’s an idea for not getting tripped up with default priors: For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.” Then the user can go back and check that the default prior makes sense for this particular example.

I’ve not incorporated this particular method into my workflow, but I like the idea and I’d like to study it further. I think this idea, or something like it, could be important.

David Weakliem on the U.S. electoral college

The sociologist and public opinion researcher has a series of excellent posts here, here, and here on the electoral college. Here’s the start:

The Electoral College has been in the news recently. I [Weakliem] am going to write a post about public opinion on the Electoral College vs. popular vote, but I was diverted into writing about the arguments offered in favor of it.

An editorial in the National Review says “it prevents New York and California from imposing their will on the rest of the country.” Taken literally, that is ridiculous–those two states combined had about 16% of the popular vote in 2016. But presumably the general idea is that the Electoral College makes it harder for a small number of large states to provide a victory. . . . In 2016, 52% of the popular vote came from 10 states: California, Florida, Texas, New York, Pennsylvania, Illinois, Ohio, Michigan, North Carolina, and Georgia (in descending order of number of votes). In the Electoral College, those states combined had 256 electoral votes–in order to win, you would need to add New Jersey (14). Even if you think the difference between ten and eleven states is important, the diversity of the ten biggest states is striking–there’s no way a candidate could win all of them without winning a lot of others.

Good point. Weakliem continues:

The National Review also says that the Electoral College keeps candidates from “retreating to their preferred pockets and running up the score.” That assumes that it’s easier to add to your lead when you already have a lead than when you are close or behind. That may be true in some sports, but in getting votes it seems that things would be more likely to go in the other direction–if you don’t have much support in a place, you have little to lose and a lot to gain. If it made any difference, election by popular vote would probably encourage parties to look outside their “preferred pockets”–e.g., the Republicans might try to compete in California rather than write it off.

I’d not thought of that before, but that sounds right. I guess we’re assuming there’s no large-scale cheating. There could be a concern that one-party-dominant states could cheat in the vote counting, or even more simply by making it harder for voters of one party to vote. Then again, this already happens, so if cheating is a concern, I think the appropriate solution is more transparency in vote counting and in the rules for where people can vote.

Weakliem then talks about public opinion:

There is always more support for abolishing [the electoral college] than keeping it—until 2016, a lot more. . . . The greatest support for abolishing it (80%) was in November 1968, right after the third-party candidacy of George Wallace, which had the goal of preventing an Electoral College majority. The election of 2000 had much less impact on opinions that 2016, maybe because of the general increase in partisanship since 2000.

A lot of recent commentary has treated abolishing the Electoral College as a radical cause, but the public generally likes the idea. . . .


I suspect that most people don’t have strong opinions, and will just follow their party, so that if it becomes a significant topic of debate there will be something close to a 50/50 split.

And then he breaks things down a bit:

The percent in favor of electing the president by popular vote in surveys ending on October 9, 2011 and November 20, 2016:

2011 2016
Democrats 74% 77%
Independents 70% 60%
Republicans 53% 28%

Weakliem presented these numbers to the fractional decimal place, but that is poor form given that variation in these numbers is much more than 1 percentage point, so it would be like reporting your weight as 193.4 pounds.

One thing I do appreciate is that Weakliem just presents the Yes proportions. Lots of times, people present both Yes and No rates, which gives you twice as many numbers to wade through, and then comparisons become much more difficult. So good job on the clean display.

Anyway, he continues with some breakdowns by state:

I used the 2011 survey to look for factors affecting state-level support. I considered number of electoral votes, margin of victory, and region. Support for the electoral college was somewhat higher in small states, which is as expected since it gives their voters more weight. There was no evidence that being in a state where the vote was close made any difference . . . Finally, the only regional distinction that appeared to matter was South vs. non-South. That makes some sense, since despite the talk about “coastal enclaves” vs. “heartland,” the South is still the most regionally distinctive part, and southerners may think that the electoral college protects their regional interests . . .

Funny that support for the electoral college isn’t higher in swing states. It’s not that I think swing-state voters are so selfish that they want the electoral college to preserve their power; it’s more the opposite, that I’d think voters in non-swing states would get annoyed that their votes don’t count. But, hey, I guess not: voters are thinking at the national, not the state level.

Lots more to look at here, I’m sure; also this is an instructive example of how much can be learned by looking carefully at available data.

P.S. I’m posting this now rather than with the usual 6-month delay, not because the subject is particularly topical—if anything, I expect it will become more topical as we go forward toward the next election—but because it demonstrates this general point of learning from observational data by looking at interesting comparisons and time trends. I’d like to have this post up, so I can point students to it when they are thinking of projects involving learning from social science data.

An interview with Tina Fernandes Botts

Hey—this is cool!

What happened was, I was scanning this list of Springbrook High School alumni. And I was like, Tina Fernandes? Class of 1982? I know that person. We didn’t know each other well, but I guess we must have been in the same homeroom a few times? All I can remember from back then is that Tina was a nice person and that she was outspoken. So it was fun to see this online interview, by Cliff Sosis, from 2017. Thanks, Cliff!

P.S. As a special bonus, here’s an article about Chuck Driesell. Chuck and I were in the same economics class, along with Yitzhak. Chuck majored in business in college, Yitzhak became an economics professor, and I never took another econ course again. Which I guess explains how I feel so confident when pontificating about economics.

P.P.S. And for another bonus, I came across this page where Ted Alper (class of 1980) answers random questions. It’s practically a blog!

Surgeon promotes fraudulent research that kills people; his employer, a leading hospital, defends him and attacks whistleblowers. Business as usual.

Paul Alper writes:

A couple of time at my suggestion, you’ve blogged about Paulo Macchiarini.

Here is an update from Susan Perry in which she interviews the director of the Swedish documentary about Macchiarini:

Indeed, Macchiarini made it sound as if his patients had recovered their health when, in fact, the synthetic tracheas he had implanted in their bodies did not work at all. His patients were dying, not thriving.

In 2015, the investigator concluded that Macchiarini had, indeed, committed research fraud. Yet the administrators [at Sweden’s Karolinska Institute] continued to defend their star surgeon — and threatened the whistleblowers with dismissal.

But then there was the fact that the leadership of the hospital and the institute had, instead of listening to the complaints, gone after the whistleblowers and had even complained [about them] to the police.

What was he thinking???

Check out this stunning exchange from the interview:

MinnPost: Did you come to any conclusion about what was motivating [Macchiarini]? It seemed at times at the documentary that he really cared about the patients. He seemed moved by them. And, yet, he then abandons them. He doesn’t follow up with them.

Bosse Lindquist [director of the documentary about this story]: I think that he feels that he deserves success in life and that he ultimately deserves something like a Nobel Prize or something like that. He thinks the world just hasn’t quite seen his excellence yet and that they will eventually. He believes that he’s helping mankind, and I think that he construes reality in such a way that he actually thinks that he was doing good with these patients, but that there were minor problems and stuff that sort of [tripped him up].

This jibes with my impressions in other, nonlethal, examples of research incompetence and research fraud: The researcher believes that he or she is an important person doing important work, and thinks of criticisms of any sort as a bunch of technicalities getting in the way of pathbreaking, potentially life-changing advances. And, of course, once you frame things in this way, a simple utilitarian calculation implies that you’re justified in all sorts of questionable behavior to derail your critics.

All of this is, in some sense, a converse to Clarke’s Law, and it also points to a general danger with utilitarianism—or, to put it another way, it points to the general value of rules and norms.

And what about the whistleblowers?

MP: And what about the whistleblowers? Have they been able to go back to their careers without any professional harm?

BL: No. Two of them have had to change cities and hospitals. Two are still there, but they have been subjected to threats from management and from some of their colleagues who were involved with Macchiarini. They have not received any new grants since this whole thing happened. It’s a crying shame.

MP: That’s quite a terrible outcome, because that may stop other people from stepping forward in similar situations.

BL: Exactly.

MP: Do you feel that everyone who was responsible for ignoring the warnings about Macchiarini has resigned or been fired?

BL: No, no, no. A number of people are still there and have their old jobs and just carry on. Some have been forced to change jobs, to get another job — but in some other function within the hospital or in the government.

And, finally . . .


MP: What has happened to the patients. One was able to successfully have the tube removed, is that correct?

BL: Yeah. One person.

MP: And everybody else has died?

BL: Yes.

The whole thing is no damn joke.

I originally called this “research-lies-allegations-windpipe update update,” but I can’t laugh about this anymore, hence the revised title above.

P.S. Alper writes:

According to the NYT’s Gretchen Reynolds, the Institute is looking into breathing again:

Two dozen healthy young male and female volunteers inhaled 12 different scents from small vials held to their noses. Some of the smells were familiar, like the essence of orange, while others were obscure. The subjects were told to memorize each scent. They went through this process on two occasions. For one, they sat quietly for an hour immediately after the sniffing, with their noses clipped shut to prevent nasal breathing; on the other, they sat for an hour with tape over their mouths to prevent oral breathing.

The men and women were consistently much better at recognizing smells if they breathed through their noses during the quiet hour. Mouth breathing resulted in fuzzier recall and more incorrect answers.

But, no numerical notion of “how much better.” And only “two dozen” subjects? Despite the defrocking of Paolo Macchiarini, the Karolinska Institute is undoubtedly still solvent so it seems strange that it undertakes a study that is more typical of a psychology professor, who has little or no funding, and seeks a publication using his students as convenient subjects. One is reminded of the famous sweaty T-shirt study.

I guess there’s always a market for one-quick-trick-that-will-change-your-life.

Most Americans like big businesses.

Tyler Cowen asks:

Why is there so much suspicion of big business?

Perhaps in part because we cannot do without business, so many people hate or resent business, and they love to criticize it, mock it, and lower its status. Business just bugs them. . . .

The short answer is, No, I don’t think there is so much suspicion of big business in this country. No, I don’t think people love to criticize, mock and lower the status of big business.

This came up a few years ago, and at the time I pulled out data from a 2007 survey showing that just about every big business you could think of was popular, with the only exception being oil companies. Microsoft, Walmart, Citibank, GM, Pfizer: you name it, the survey respondents were overwhelmingly positive.

Nearly two-thirds of respondents say corporate profits are too high, but, “more than seven in ten agree that ‘the strength of this country today is mostly based on the success of American business’ – an opinion that has changed very little over the past 20 years.”

Corporations are more popular with Republicans than with Democrats, but most of the corporations in the survey were popular with a clear majority in either party.

Big business does lots of things for us, and the United States is a proudly capitalist country, so it’s no shocker that most businesses in the survey were very popular.

So maybe the question is, Why did an economist such as Cowen think that people view big business so negatively?

My quick guess is that we notice negative statements more than positive statements. Cowen himself roots for big business, he’s generally on the side of big business, so when he sees any criticism of it, he bristles. He notices the criticism and is bothered by it. When he sees positive statements about big business, that all seems so sensible that perhaps he hardly notices. The negative attitudes are jarring to him so more noticeable. Perhaps in the same way that I notice bad presentations of data. An ugly table or graph is to me like fingernails on the blackboard.

Anyway, it’s perfectly reasonable for Cowen to be interested in those people who “hate or resent business, and they love to criticize it, mock it, and lower its status.” We should just remember that, at least from these survey data, it seems that this is a small minority of people.

Why did I write this post?

The bigger point here is that this is an example of something I see a lot, which is a social scientist or pundit coming up with theories to explain some empirical pattern in the world, but it turns out the pattern isn’t actually real. This came up years ago with Red State Blue State, when I noticed journalists coming up with explanations for voting patterns that were not happening (see for example here) and of course it comes up a lot with noise-mining research, whether it be a psychologist coming up with theories to explain ESP, or a sociologist coming up with theories to explain spurious patterns in sex ratios.

It’s fine to explain data; it’s just important to be aware of what’s being explained. In the context of the above-linked Cowen post, it’s fine to answer the question, “If business is so good, why is it so disliked?”—as long as this sentence is completed as follows: “If business is so good, why is it so disliked by a minority of Americans?” Explaining minority positions is important; we should just be clear it’s a minority.

Or of course it’s possible that Cowen has access to other data I haven’t looked at, perhaps more recent surveys that would modify my empirical understanding. That would be fine too.

P.S. The title of this post was originally “Most Americans like big business.” I changed the last word to “businesses” in response to comments who pointed out that most Americans express negative views about “big business” in general, but they like most individual big businesses that they’re asked about.

Markov chain Monte Carlo doesn’t “explore the posterior”

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of 2^{20}, or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

\begin{array}{rcl} \hat{\theta} & = & \mathbb{E}[\theta \mid y] \\[8pt] & = & \int_{\Theta} \theta \times p(\theta \mid y) \, \mbox{d}\theta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \theta^{(m)},\end{array}

where \theta^{(1)}, \ldots, \theta^{(M)} are draws from the posterior p(\theta \mid y).

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

p(\tilde{y} \mid y) \ = \ \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] \ \approx \ \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}),

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

\mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

Jonathan (another one) does Veronica Geng does Robert Mueller

Frequent commenter Jonathan (another one) writes:

I realize that so many people bitch about the seminar showdown that you might need at one thank you. This year, I managed to re-read the bulk of Geng, and for that I thank you. I have not yet read any Sattouf, but it clearly has made an impression on you, so it’s on my list.

In thanks, my first brief foray into pseudo-Gengiana, I think I’ve got the tone roughly right, but I’m way short on whimsy, but this is what I managed in a sustained fifteen minute effort. Thanks again.

My fellow Americans:

As you are no doubt aware, I have completed my investigation and report. I write this to inform you of an unfortunate mishap from Friday. Many news outlets have reported that my final report was taken by security guard from my offices to the Justice Department. That is not true. In an attempt to maintain my obsessive secrecy, that was a dummy report, actually containing the text of an unpublished novel by David Foster Wallace that we found in Michael Cohen’s safe. We couldn’t understand it—maybe Bill Barr will have better luck.

The real one was handed to my intern, Jeff, in an ordinary interoffice envelope, and Jeff was told to drop it off at Justice on his way home. He lives nearby with six other interns. Not knowing what he had, he stopped off at the Friday Trivia Happy Hour at the Death and Taxes Pub, drank a little too much, and left the report there. We’ve gone back to look and nobody can find it.
So why not just print out another one? Or for that matter, why didn’t I just email the first report? As you’ve no doubt gleaned by now, computers and email aren’t my thing. As my successor at the FBI, Mr. Comey, demonstrated, email baffles just about all of us. And I don’t use a computer. So there isn’t another copy of the real report. I’ve got all my notes, though, so I ought to be able to cobble together a new report in a couple of months.

Apologies for the delay,
Robert Mueller

PS: Jeff has been chastised. We haven’t fired him, but in asking him about this he let slip that his parents didn’t pay taxes on the nanny who raised him and they may have strongly implied that he played on a high school curling team to get into college. His parents are going to jail and the nanny’s immigration status is being investigated. This requires a short re-opening of the investigation.

The mention of “Jeff” seems particularly Geng-like to me. Perhaps I’m reminded of “Ed.” Thinking of Geng makes me a bit sad, though, not just for her but because it reminds me of the passage of time. I associate Geng, Bill James, and Spy magazine with the mid-1980s. Ahhh, lost youth!

Yes, I really really really like fake-data simulation, and I can’t stop talking about it.

Rajesh Venkatachalapathy writes:

Recently, I had a conversation with a colleague of mine about the virtues of synthetic data and their role in data analysis. I think I’ve heard a sermon/talk or two where you mention this and also in your blog entries. But having convinced my colleague of this point, I am struggling to find good references on this topic.

I was hoping to get some leads from you.

My reply:

Hi, here are some refs: from 2009, 2011, 2013, also this and this and this from 2017, and this from 2018. I think I’ve missed a few, too.

If you want something in dead-tree style, see Section 8.1 of my book with Jennifer Hill, which came out in 2007.

Or, for some classic examples, there’s Bush and Mosteller with the “stat-dogs” in 1954, and Ripley with his simulated spatial processes from, ummmm, 1987 I think it was? Good stuff, all. We should be doing more of it.