Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A few months ago I sent the following message to some people:

Dear philosophically-inclined colleagues:

I’d like to organize an online discussion of Deborah Mayo’s new book.

The table of contents and some of the book are here at Google books, also in the attached pdf and in this post by Mayo.

I think that many, if not all, of Mayo’s points in her Excursion 4 are answered by my article with Hennig here.

What I was thinking for this discussion is that if you’re interested you can write something, either a review of Mayo’s book (if you happen to have a copy of it) or a review of the posted material, or just your general thoughts on the topic of statistical inference as severe testing.

I’m hoping to get this all done this month, because it’s all informal and what’s the point of dragging it out, right? So if you’d be interested in writing something on this that you’d be willing to share with the world, please let me know. It should be fun, I hope!

I did this in consultation with Deborah Mayo, and I just sent this email to a few people (so if you were not included, please don’t feel left out! You have a chance to participate right now!), because our goal here was to get the discussion going. The idea was to get some reviews, and this could spark a longer discussion here in the comments section.

And, indeed, we received several responses. And I’ll also point you to my paper with Shalizi on the philosophy of Bayesian statistics, with discussions by Mark Andrews and Thom Baguley, Denny Borsboom and Brian Haig, John Kruschke, Deborah Mayo, Stephen Senn, and Richard D. Morey, Jan-Willem Romeijn and Jeffrey N. Rouder.

Also relevant is this summary by Mayo of some examples from her book.

And now on to the reviews.

Brian Haig

I’ll start with psychology researcher Brian Haig, because he’s a strong supporter of Mayo’s message and his review also serves as an introduction and summary of her ideas. The review itself is a few pages long, so I will quote from it, interspersing some of my own reaction:

Deborah Mayo’s ground-breaking book, Error and the growth of statistical knowledge (1996) . . . presented the first extensive formulation of her error-statistical perspective on statistical inference. Its novelty lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with evidence and inference.

By contrast, Mayo’s just-published book, Statistical inference as severe testing (SIST) (2018), focuses on problems arising from statistical practice (“the statistics wars”), but endeavors to solve them by probing their foundations from the vantage points of philosophy of science, and philosophy of statistics. The “statistics wars” to which Mayo refers concern fundamental debates about the nature and foundations of statistical inference. These wars are longstanding and recurring. Today, they fuel the ongoing concern many sciences have with replication failures, questionable research practices, and the demand for an improvement of research integrity. . . .

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, a package deal comprising effect sizes, confidence intervals, and meta-analysis, is one reform movement that has been heavily promoted in psychological circles (Cumming, 2012; 2014) as a much needed successor to null hypothesis significance testing (NHST) . . .

The new statisticians recommend replacing NHST with their favored statistical methods by asserting that it has several major flaws. Prominent among them are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. . . .

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: A parameter estimate is either inside, or outside, its confidence interval.

At this point I’d like to interrupt and say that a confidence or interval (or simply an estimate with standard error) can be used to give a sense of inferential uncertainty. There is no reason for dichotomous thinking when confidence intervals, or uncertainty intervals, or standard errors, are used in practice.

Here’s a very simple example from my book with Jennifer:

This graph has a bunch of estimates +/- standard errors, that is, 68% confidence intervals, with no dichotomous thinking in sight. In contrast, testing some hypothesis of no change over time, or no change during some period of time, would make no substantive sense and would just be an invitation to add noise to our interpretation of these data.

OK, to continue with Haig’s review:

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The standard account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). . . . By contrast, the error-statistician draws inferences about each of the obtained values according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Crucially, the different values will not have the same probative force. . . . Details on the error-statistical conception of confidence intervals can be found in SIST (pp. 189-201), as well as Mayo and Spanos (2011) and Spanos (2014). . . .

SIST makes clear that, with its error-statistical perspective, statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science.

Another interruption from me . . . I just want to plug my paper with Guido Imbens, Why ask why? Forward causal inference and reverse causal questions, in which we argue that Why questions can be interpreted as model checks, or, one might say, hypothesis tests—but tests of hypotheses of interest, not of straw-man null hypotheses. Perhaps there’s some connection between Mayo’s ideas and those of Guido and me on this point.

Haig continues with a discussion of Bayesian methods, including those of my collaborators and myself:

One particularly important modern variant of Bayesian thinking, which receives attention in SIST, is the falsificationist Bayesianism of . . . Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. . . . Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to see how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. . . .

Hey, not quite! I’ve done a lot of collaboration with psychologists; see here and search on “Iven Van Mechelen” and “Francis Tuerlinckx”—but, sure, I recognize that our Bayesian methods, while mainstream in various fields including ecology and political science, are not yet widely used in psychology.

Haig concludes:

From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. . . . A heartening attitude that comes through in SIST is the firm belief that a philosophy of statistics is an important part of statistical thinking. This contrasts markedly with much of statistical theory, and most of statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves practice in useful ways.

I agree, very much.

To paraphrase Bill James, the alternative to good philosophy is not “no philosophy,” it’s “bad philosophy.” I’ve spent too much time seeing Bayesians avoid checking their models out of a philosophical conviction that subjective priors cannot be empirically questioned, and too much time seeing non-Bayesians produce ridiculous estimates that could have been avoided by using available outside information. There’s nothing so practical as good practice, but good philosophy can facilitate both the development and acceptance of better methods.

E. J. Wagenmakers

I’ll follow up with a very short review, or, should I say, reaction-in-place-of-a-review, from psychometrician E. J. Wagenmakers:

I cannot comment on the contents of this book, because doing so would require me to read it, and extensive prior knowledge suggests that I will violently disagree with almost every claim that is being made. Hence I will solely review the book’s title, and state my prediction that the “statistics wars” will not be over until the last Fisherian is strung up by the entrails of the last Neyman-Pearsonite, and all who remain have been happily assimilated by the Bayesian Borg. When exactly this event will transpire I don’t know, but I fear I shall not be around to witness it. In my opinion, the only long-term hope for vague concepts such as the “severity” of a test is to embed them within a rational (i.e., Bayesian) framework, but I suspect that this is not the route that the author wishes to pursue. Perhaps this book is comforting to those who have neither the time nor the desire to learn Bayesian inference, in a similar way that homeopathy provides comfort to patients with a serious medical condition.

You don’t have to agree with E. J. to appreciate his honesty!

Art Owen

Coming from a different perspective is theoretical statistician Art Owen, whose review has some mathematical formulas—nothing too complicated, but not so easy to display in html, so I’ll just link to the pdf and share some excerpts:

There is an emphasis throughout on the importance of severe testing. It has long been known that a test that fails to reject H0 is not very conclusive if it had low power to reject H0. So I wondered whether there was anything more to the severity idea than that. After some searching I found on page 343 a description of how the severity idea differs from the power notion. . . .

I think that it might be useful in explaining a failure to reject H0 as the sample size being too small. . . . it is extremely hard to measure power post hoc because there is too much uncertainty about the effect size. Then, even if you want it, you probably cannot reliably get it. I think severity is likely to be in the same boat. . . .

I believe that the statistical problem from incentives is more severe than choice between Bayesian and frequentist methods or problems with people not learning how to use either kind of method properly. . . . We usually teach and do research assuming a scientific loss function that rewards being right. . . . In practice many people using statistics are advocates. . . . The loss function strongly informs their analysis, be it Bayesian or frequentist. The scientist and advocate both want to minimize their expected loss. They are led to different methods. . . .

I appreciate Owen’s efforts to link Mayo’s words to the equations that we would ultimately need to implement, or evaluate, her ideas in statistics.

Robert Cousins

Physicist Robert Cousins did not have the time to write a comment on Mayo’s book, but he did point us to this monograph he wrote on the foundations of statistics, which has lots of interesting stuff but is unfortunately a bit out of date when it comes to the philosophy of Bayesian statistics, which he ties in with subjective probability. (For a corrective, see my aforementioned article with Hennig.)

In his email to me, Cousins also addressed issues of statistical and practical significance:

Our [particle physicists’] problems and the way we approach them are quite different from some other fields of science, especially social science. As one example, I think I recall reading that you do not mind adding a parameter to your model, whereas adding (certain) parameters to our models means adding a new force of nature (!) and a Nobel Prize if true. As another example, a number of statistics papers talk about how silly it is to claim a 10^{⁻4} departure from 0.5 for a binomial parameter (ESP examples, etc), using it as a classic example of the difference between nominal (probably mismeasured) statistical significance and practical significance. In contrast, when I was a grad student, a famous experiment in our field measured a 10^{⁻4} departure from 0.5 with an uncertainty of 10% of itself, i.e., with an uncertainty of 10^{⁻5}. (Yes, the order or 10^10 Bernoulli trials—counting electrons being scattered left or right.) This led quickly to a Nobel Prize for Steven Weinberg et al., whose model (now “Standard”) had predicted the effect.

I replied:

This interests me in part because I am a former physicist myself. I have done work in physics and in statistics, and I think the principles of statistics that I have applied to social science, also apply to physical sciences. Regarding the discussion of Bem’s experiment, what I said was not that an effect of 0.0001 is unimportant, but rather that if you were to really believe Bem’s claims, there could be effects of +0.0001 in some settings, -0.002 in others, etc. If this is interesting, fine: I’m not a psychologist. One of the key mistakes of Bem and others like him is to suppose that, even if they happen to have discovered an effect in some scenario, there is no reason to suppose this represents some sort of universal truth. Humans differ from each other in a way that elementary particles to not.

And Cousins replied:

Indeed in the binomial experiment I mentioned, controlling unknown systematic effects to the level of 10^{-5}, so that what they were measuring (a constant of nature called the Weinberg angle, now called the weak mixing angle) was what they intended to measure, was a heroic effort by the experimentalists.

Stan Young

Stan Young, a statistician who’s worked in the pharmaceutical industry, wrote:

I’ve been reading at the Mayo book and also pestering where I think poor statistical practice is going on. Usually the poor practice is by non-professionals and usually it is not intentionally malicious however self-serving. But I think it naive to think that education is all that is needed. Or some grand agreement among professional statisticians will end the problems.

There are science crooks and statistical crooks and there are no cops, or very few.

That is a long way of saying, this problem is not going to be solved in 30 days, or by one paper, or even by one book or by three books! (I’ve read all three.)

I think a more open-ended and longer dialog would be more useful with at least some attention to willful and intentional misuse of statistics.

Chambers C. The Seven Deadly Sins of Psychology. New Jersey: Princeton University Press, 2017.

Harris R. Rigor mortis: how sloppy science creates worthless cures, crushes hope, and wastes billions. New York: Basic books, 2017.

Hubbard R. Corrupt Research. London: Sage Publications, 2015.

Christian Hennig

Hennig, a statistician and my collaborator on the Beyond Subjective and Objective paper, send in two reviews of Mayo’s book.

Here are his general comments:

What I like about Deborah Mayo’s “Statistical Inference as Severe Testing”

Before I start to list what I like about “Statistical Inference as Severe Testing”. I should say that I don’t agree with everything in the book. In particular, as a constructivist I am skeptical about the use of terms like “objectivity”, “reality” and “truth” in the book, and I think that Mayo’s own approach may not be able to deliver everything that people may come to believe it could, from reading the book (although Mayo could argue that overly high expectations could be avoided by reading carefully).

So now, what do I like about it?

1) I agree with the broad concept of severity and severe testing. In order to have evidence for a claim, it has to be tested in ways that would reject the claim with high probability if it indeed were false. I also think that it makes a lot of sense to start a philosophy of statistics and a critical discussion of statistical methods and reasoning from this requirement. Furthermore, throughout the book Mayo consistently argues from this position, which makes the different “Excursions” fit well together and add up to a consistent whole.

2) I get a lot out of the discussion of the philosophical background of scientific inquiry, of induction, probabilism, falsification and corroboration, and their connection to statistical inference. I think that it makes sense to connect Popper’s philosophy to significance tests in the way Mayo does (without necessarily claiming that this is the only possible way to do it), and I think that her arguments are broadly convincing at least if I take a realist perspective of science (which as a constructivist I can do temporarily while keeping the general reservation that this is about a specific construction of reality which I wouldn’t grant absolute authority).

3) I think that Mayo does by and large a good job listing much of the criticism that has been raised in the literature against significance testing, and she deals with it well. Partly she criticises bad uses of significance testing herself by referring to the severity requirement, but she also defends a well understood use in a more general philosophical framework of testing scientific theories and claims in a piecemeal manner. I find this largely convincing, conceding that there is a lot of detail and that I may find myself in agreement with the occasional objection against the odd one of her arguments.

4) The same holds for her comprehensive discussion of Bayesian/probabilist foundations in Excursion 6. I think that she elaborates issues and inconsistencies in the current use of Bayesian reasoning very well, maybe with the odd exception.

5) I am in full agreement with Mayo’s position that when using probability modelling, it is important to be clear about the meaning of the computed probabilities. Agreement in numbers between different “camps” isn’t worth anything if the numbers mean different things. A problem with some positions that are sold as “pragmatic” these days is that often not enough care is put into interpreting what the results mean, or even deciding in advance what kind of interpretation is desired.

6) As mentioned above, I’m rather skeptical about the concept of objectivity and about an all too realist interpretation of statistical models. I think that in Excursion 4 Mayo manages to explain in a clear manner what her claims of “objectivity” actually mean, and she also appreciates more clearly than before the limits of formal models and their distance to “reality”, including some valuable thoughts on what this means for model checking and arguments from models.

So overall it was a very good experience to read her book, and I think that it is a very valuable addition to the literature on foundations of statistics.

Hennig also sent some specific discussion of one part of the book:

1 Introduction

This text discusses parts of Excursion 4 of Mayo (2018) titled “Objectivity and Auditing”. This starts with the section title “The myth of ‘The myth of objectivity’”. Mayo advertises objectivity in science as central and as achievable.

In contrast, in Gelman and Hennig (2017) we write: “We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes.” I will here outline agreement and disagreement that I have with Mayo’s Excursion 4, and raise some issues that I think require more research and discussion.

2 Pushback and objectivity

The second paragraph of Excursion 4 states in bold letters: “The Key Is Getting Pushback”, and this is the major source of agreement between Mayo’s and my views (*). I call myself a constructivist, and this is about acknowledging the impact of human perception, action, and communication on our world-views, see Hennig (2010). However, it is an almost universal experience that we cannot construct our perceived reality as we wish, because we experience “pushback” from what we perceive as “the world outside”. Science is about allowing us to deal with this pushback in stable ways that are open to consensus. A major ingredient of such science is the “Correspondence (of scientific claims) to observable reality”, and in particular “Clear conditions for reproduction, testing and falsification”, listed as “Virtue 4/4(b)” in Gelman and Hennig (2017). Consequently, there is no disagreement with much of the views and arguments in Excursion 4 (and the rest of the book). I actually believe that there is no contradiction between constructivism understood in this way and Chang’s (2012) “active scientific realism” that asks for action in order to find out about “resistance from reality”, or in other words, experimenting, experiencing and learning from error.

If what is called “objectivity” in Mayo’s book were the generally agreed meaning of the term, I would probably not have a problem with it. However, there is a plethora of meanings of “objectivity” around, and on top of that the term is often used as a sales pitch by scientists in order to lend authority to findings or methods and often even to prevent them from being questioned. Philosophers understand that this is a problem but are mostly eager to claim the term anyway; I have attended conferences on philosophy of science and heard a good number of talks, some better, some worse, with messages of the kind “objectivity as understood by XYZ doesn’t work, but here is my own interpretation that fixes it”. Calling frequentist probabilities “objective” because they refer to the outside world rather than epsitemic states, and calling a Bayesian approach “objective” because priors are chosen by general principles rather than personal beliefs are in isolation also legitimate meanings of “objectivity”, but these two and Mayo’s and many others (see also the Appendix of Gelman and Hennig, 2017) differ. The use of “objectivity” in public and scientific discourse is a big muddle, and I don’t think this will change as a consequence of Mayo’s work. I prefer stating what we want to achieve more precisely using less loaded terms, which I think Mayo has achieved well not by calling her approach “objective” but rather by explaining in detail what she means by that.

3. Trust in models?

In the remainder, I will highlight some limitations of Mayo’s “objectivity” that are mainly connected to Tour IV on objectivity, model checking and whether it makes sense to say that “all models are false”. Error control is central for Mayo’s objectivity, and this relies on error probabilities derived from probability models. If we want to rely on these error probabilities, we need to trust the models, and, very appropriately, Mayo devotes Tour IV to this issue. She concedes that all models are false, but states that this is rather trivial, and what is really relevant when we use statistical models for learning from data is rather whether the models are adequate for the problem we want to solve. Furthermore, model assumptions can be tested and it is crucial to do so, which, as follows from what was stated before, does not mean to test whether they are really true but rather whether they are violated in ways that would destroy the adequacy of the model for the problem. So far I can agree. However, I see some difficulties that are not addressed in the book, and mostly not elsewhere either. Here is a list.

3.1. Adaptation of model checking to the problem of interest

As all models are false, it is not too difficult to find model assumptions that are violated but don’t matter, or at least don’t matter in most situations. The standard example would be the use of continuous distributions to approximate distributions of essentially discrete measurements. What does it mean to say that a violation of a model assumption doesn’t matter? This is not so easy to specify, and not much about this can be found in Mayo’s book or in the general literature. Surely it has to depend on what exactly the problem of interest is. A simple example would be to say that we are interested in statements about the mean of a discrete distribution, and then to show that estimation or tests of the mean are very little affected if a certain continuous approximation is used. This is reassuring, and certain other issues could be dealt with in this way, but one can ask harder questions. If we approximate a slightly skew distribution by a (unimodal) symmetric one, are we really interested in the mean, the median, or the mode, which for a symmetric distribution would be the same but for the skew distribution to be approximated would differ? Any frequentist distribution is an idealisation, so do we first need to show that it is fine to approximate a discrete non-distribution by a discrete distribution before worrying whether the discrete distribution can be approximated by a continuous one? (And how could we show that?) And so on.

3.2. Severity of model misspecification tests

Following the logic of Mayo (2018), misspecification tests need to be severe in ordert to fulfill their purpose; otherwise data could pass a misspecification test that would be of little help ruling out problematic model deviations. I’m not sure whether there are any results of this kind, be it in Mayo’s work or elsewhere. I imagine that if the alternative is parametric (for example testing independence against a standard time series model) severity can occasionally be computed easily, but for most model misspecification tests it will be a hard problem.

3.3. Identifiability issues, and ruling out models by other means than testing

Not all statistical models can be distinguished by data. For example, even with arbitrarily large amounts of data only lower bounds of the number of modes can be estimated; an assumption of unimodality can strictly not be tested (Donoho 1988). Worse, only regular but not general patterns of dependence can be distinguished from independence by data; any non-i.i.d. pattern can be explained by either dependence or non-identity of distributions, and telling these apart requires constraints on dependence and non-identity structures that can itself not be tested on the data (in the example given in 4.11 of Mayo, 2018, all tests discover specific regular alternatives to the model assumption). Given that this is so, the question arises on which grounds we can rule out irregular patterns (about the simplest and most silly one is “observations depend in such a way that every observation determines the next one to be exactly what it was observed to be”) by other means than data inspection and testing. Such models are probably useless, however if they were true, they would destroy any attempt to find “true” or even approximately true error probabilities.

3.4. Robustness against what cannot be ruled out

The above implies that certain deviations from the model assumptions cannot be ruled out, and then one can ask: How robust is the substantial conclusion that is drawn from the data against models different from the nominal one, which could not be ruled out by misspecification testing, and how robust are error probabilities? The approaches of standard robust statistics probably have something to contribute in this respect (e.g., Hampel et al., 1986), although their starting point is usually different from “what is left after misspecification testing”. This will depend, as everything, on the formulation of the “problem of interest”, which needs to be defined not only in terms of the nominal parametric model but also in terms of the other models that could not be rules out.

3.5. The effect of preliminary model checking on model-based inference

Mayo is correctly concerned about biasing effects of model selection on inference. Deciding what model to use based on misspecification tests is some kind of model selection, so it may bias inference that is made in case of passing misspecification tests. One way of stating the problem is to realise that in most cases the assumed model conditionally on having passed a misspecification test does no longer hold. I have called this the “goodness-of-fit paradox” (Hennig, 2007); the issue has been mentioned elsewhere in the literature. Mayo has argued that this is not a problem, and this is in a well defined sense true (meaning that error probabilities derived from the nominal model are not affected by conditioning on passing a misspecification test) if misspecification tests are indeed “independent of (or orthogonal to) the primary question at hand” (Mayo 2018, p. 319). The problem is that for the vast majority of misspecification tests independence/orthogonality does not hold, at least not precisely. So the actual effect of misspecification testing on model-based inference is a matter that requires to be investigated on a case-by-case basis. Some work of this kind has been done or is currently done; results are not always positive (an early example is Easterling and Anderson 1978).

4 Conclusion

The issues listed in Section 3 are in my view important and worthy of investigation. Such investigation has already been done to some extent, but there are many open problems. I believe that some of these can be solved, some are very hard, and some are impossible to solve or may lead to negative results (particularly connected to lack of identifiability). However, I don’t think that these issues invalidate Mayo’s approach and arguments; I expect at least the issues that cannot be solved to affect in one way or another any alternative approach. My case is just that methodology that is “objective” according to Mayo comes with limitations that may be incompatible with some other peoples’ ideas of what “objectivity” should mean (in which sense it is in good company though), and that the falsity of models has some more cumbersome implications than Mayo’s book could make the reader believe.

(*) There is surely a strong connection between what I call “my” view here with the collaborative position in Gelman and Hennig (2017), but as I write the present text on my own, I will refer to “my” position here and let Andrew Gelman speak for himself.

Chang, H. (2012) Is Water H2O? Evidence, Realism and Pluralism. Dordrecht: Springer.

Donoho, D. (1988) One-Sided Inference about Functionals of a Density. Annals of Statistics 16, 1390-1420.

Easterling, R. G. and Anderson, H.E. (1978) The effect of preliminary normality goodness of fit tests on subsequent inference. Journal of Statistical Computation and Simulation 8, 1-11.

Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the Royal Statistical Society, Series A 180, 967–1033.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust statistics. New York: Wiley.

Hennig, C. (2010) Mathematical models and reality: a constructivist perspective. Foundations of Science 15, 29–48.

Hennig, C. (2007) Falsification of propensity models by statistical tests and the goodness-of-fit paradox. Philosophia Mathematica 15, 166-192.

Mayo, D. G. (2018) Statistical Inference as Severe Testing. Cambridge University Press.

My own reactions

I’m still struggling with the key ideas of Mayo’s book. (Struggling is a good thing here, I think!)

First off, I appreciate that Mayo takes my own philosophical perspective seriously—I’m actually thrilled to be taken seriously, after years of dealing with a professional Bayesian establishment tied to naive (as I see it) philosophies of subjective or objective probabilities, and anti-Bayesians not willing to think seriously about these issues at all—and I don’t think any of these philosophical issues are going to be resolved any time soon. I say this because I’m so aware of the big Cantor-size hole in the corner of my own philosophy of statistical learning.

In statistics—maybe in science more generally—philosophical paradoxes are sometimes resolved by technological advances. Back when I was a student I remember all sorts of agonizing over the philosophical implications of exchangeability, but now that we can routinely fit varying-intercept, varying-slope models with nested and non-nested levels and (we’ve finally realized the importance of) informative priors on hierarchical variance parameters, a lot of the philosophical problems have dissolved; they’ve become surmountable technical problems. (For example: should we consider a group of schools, or states, or hospitals, as “truly exchangeable”? If not, there’s information distinguishing them, and we can include such information as group-level predictors in our multilevel model. Problem solved.)

Rapid technological progress resolves many problems in ways that were never anticipated. (Progress creates new problems too; that’s another story.) I’m not such an expert on deep learning and related methods for inference and prediction—but, again, I think these will change our perspective on statistical philosophy in various ways.

This is all to say that any philosophical perspective is time-bound. On the other hand, I don’t think that Popper/Kuhn/Lakatos will ever be forgotten: this particular trinity of twentieth-century philosophy of science has forever left us in a different place than where we were, a hundred years ago.

To return to Mayo’s larger message: I agree with Hennig that Mayo is correct to place evaluation at the center of statistics.

I’ve thought a lot about this, in many years of teaching statistics to graduate students. In a class for first-year statistics Ph.D. students, you want to get down to the fundamentals.

What’s the most fundamental thing in statistics? Experimental design? No. You can’t really pick your design until you have some sense of how you will analyze the data. (This is the principle of the great Raymond Smullyan: To understand the past, we must first know the future.) So is data analysis the most fundamental thing? Maybe so, but what method of data analysis? Last I heard, there are many schools. Bayesian data analysis, perhaps? Not so clear; what’s the motivation for modeling everything probabilistically? Sure, it’s coherent—but so is some mental patient who thinks he’s Napoleon and acts daily according to that belief. We can back into a more fundamental, or statistical, justification of Bayesian inference and hierarchical modeling by first considering the principle of external validation of predictions, then showing (both empirically and theoretically) that a hierarchical Bayesian approach performs well based on this criterion—and then following up with the Jaynesian point that, when Bayesian inference fails to perform well, this recognition represents additional information that can and should be added to the model. All of this is the theme of the example in section 7 of BDA3—although I have the horrible feeling that students often don’t get the point, as it’s easy to get lost in all the technical details of the inference for the hyperparameters in the model.

Anyway, to continue . . . it still seems to me that the most foundational principles of statistics are frequentist. Not unbiasedness, not p-values, and not type 1 or type 2 errors, but frequency properties nevertheless. Statements about how well your procedure will perform in the future, conditional on some assumptions of stationarity and exchangeability (analogous to the assumption in physics that the laws of nature will be the same in the future as they’ve been in the past—or, if the laws of nature are changing, that they’re not changing very fast! We’re in Cantor’s corner again).

So, I want to separate the principle of frequency evaluation—the idea that frequency evaluation and criticism represents one of the three foundational principles of statistics (with the other two being mathematical modeling and the understanding of variation)—from specific statistical methods, whether they be methods that I like (Bayesian inference, estimates and standard errors, Fourier analysis, lasso, deep learning, etc.) or methods that I suspect have done more harm than good or, at the very least, have been taken too far (hypothesis tests, p-values, so-called exact tests, so-called inverse probability weighting, etc.). We can be frequentists, use mathematical models to solve problems in statistical design and data analysis, and engage in model criticism, without making decisions based on type 1 error probabilities etc.

To say it another way, bringing in the title of the book under discussion: I would not quite say that statistical inference is severe testing, but I do think that severe testing is a crucial part of statistics. I see statistics as an unstable mixture of inference conditional on a model (“normal science”) and model checking (“scientific revolution”). Severe testing is fundamental, in that prospect of revolution is a key contributor to the success of normal science. We lean on our models in large part because they have been, and will continue to be, put to the test. And we choose our statistical methods in large part because, under certain assumptions, they have good frequency properties.

And now on to Mayo’s subtitle. I don’t think her, or my, philosophical perspective will get us “beyond the statistics wars” by itself—but perhaps it will ultimately move us in this direction, if practitioners and theorists alike can move beyond naive confirmationist reasoning toward an embrace of variation and acceptance of uncertainty.

I’ll summarize by expressing agreement with Mayo’s perspective that frequency evaluation is fundamental, while disagreeing with her focus on various crude (from my perspective) ideas such as type 1 errors and p-values. When it comes to statistical philosophy, I’d rather follow Laplace, Jaynes, and Box, rather than Neyman, Wald, and Savage. Phony Bayesmania has bitten the dust.


Let me again thank Haig, Wagenmakers, Owen, Cousins, Young, and Hennig for their discussions. I expect that Mayo will respond to these, and also to any comments that follow in this thread, once she has time to digest it all.

What sort of identification do you get from panel data if effects are long-term? Air pollution and cognition example.

Don MacLeod writes:

Perhaps you know this study which is being taken at face value in all the secondary reports: “Air pollution causes ‘huge’ reduction in intelligence, study reveals.” It’s surely alarming, but the reported effect of air pollution seems implausibly large, so it’s hard to be convinced of it by a correlational study alone, when we can suspect instead that the smarter, more educated folks are more likely to be found in polluted conditions for other reasons. They did try to allow for the usual covariates, but there is the usual problem that you never know whether you’ve done enough of that.

Assuming equal statistical support, I suppose the larger an effect, the less likely it is to be due to uncontrolled covariates. But also the larger the effect, the more reasonable it is to demand strongly convincing evidence before accepting it.

From the above-linked news article:

“Polluted air can cause everyone to reduce their level of education by one year, which is huge,” said Xi Chen at Yale School of Public Health in the US, a member of the research team. . . .

The new work, published in the journal Proceedings of the National Academy of Sciences, analysed language and arithmetic tests conducted as part of the China Family Panel Studies on 20,000 people across the nation between 2010 and 2014. The scientists compared the test results with records of nitrogen dioxide and sulphur dioxide pollution.

They found the longer people were exposed to dirty air, the bigger the damage to intelligence, with language ability more harmed than mathematical ability and men more harmed than women. The researchers said this may result from differences in how male and female brains work.

The above claims are indeed bold, but the researchers seem pretty careful:

The study followed the same individuals as air pollution varied from one year to the next, meaning that many other possible causal factors such as genetic differences are automatically accounted for.

The scientists also accounted for the gradual decline in cognition seen as people age and ruled out people being more impatient or uncooperative during tests when pollution was high.

Following the same individuals through the study: that makes a lot of sense.

I hadn’t heard of this study when it came out so I followed the link and read it now.

You can model the effects of air pollution as short-term or long-term. An example of a short-term effect is that air pollution makes it harder to breathe, you get less oxygen in your brain, etc., or maybe you’re just distracted by the discomfort and can’t think so well. An example of a long-term effect is that air pollution damages your brain or other parts of your body in various ways that impact your cognition.

The model includes air pollution levels on the day of measurement and on the past few days or months or years, and also a quadratic monthly time trend from Jan 2010 to Dec 2014. A quadratic time trend, that seems weird, kinda worrying. Are people’s test scores going up and down in that way?

In any case, their regression finds that air pollution levels from the past months or years are a strong predictor of the cognitive test outcome, and today’s air pollution doesn’t add much predictive power after including the historical pollution level.

Some minor things:

Measurement of cognitive performance:

The waves 2010 and 2014 contain the same cognitive ability module, that is, 24 standardized mathematics questions and 34 word-recognition questions. All of these questions are sorted in ascending order of difficulty, and the final test score is defined as the rank of the hardest question that a respondent is able to answer correctly.

Huh? Are you serious? Wouldn’t it be better to use the number of questions answered correctly? Even better would be to fit a simple item-response model, but I’d guess that #correct would capture almost all the relevant information in the data. But to just use the rank of the hardest question answered correctly: that seems inefficient, no?

Comparison between the sexes:

The authors claim that air pollution has a larger effect on men than on women (see above quote from the news article). But I suspect this is yet another example of The difference between “significant” and “not significant” is not itself statistically significant. It’s hard to tell. For example, there’s this graph:

The plot on the left shows a lot of consistency across age groups. Too much consistency, I think. I’m guessing that there’s something in the model keeping these estimates similar to each other, i.e. I don’t think they’re five independent results.

The authors write:

People may become more impatient or uncooperative when exposed to more polluted air. Therefore, it is possible that the observed negative effect on cognitive performance is due to behavioral change rather than impaired cognition. . . . Changes in the brain chemistry or composition are likely more plausible channels between air pollution and cognition.

I think they’re missing the point here and engaging in a bit of “scientism” or “mind-body dualism” in the following way: Suppose that air pollution irritates people, making it hard for people to concentrate on cognitive tasks. That is a form of impaired cognition. Just cos it’s “behavioral,” doesn’t make it not real.

In any case, putting this all together, what can we say? This seems like a serious analysis, and to start with the authors should make all their data and code available so that others can try fitting their own models. This is an important problem, so it’s good to have as many eyes on the data as possible.

In this particular example, it seems that the key information is coming from:

– People who moved from one place to another, either moving from a high-pollution to a low-pollution area or vice-versa, and then you can see if their test scores went correspondingly up or down. After adjusting for expected cognitive decline by age during this period.

– People who lived in the same place but where there was a negative or positive trend in pollution. Again you can see if these people’s test scores went up or down. Again, after adjusting for expected cognitive decline by age during this period.

– People who didn’t move, comparing these people who lived all along in high- or low-pollution areas, and seeing who had higher test scores. After adjusting for demographic differences between people living in these different cities.

This leaves me with two thoughts:

First, I’d like to see the analyses in these three different groups. One big regression is fine, but in this sort of problem I think it’s important to understand the path from data to conclusions. This is especially an issue given that we might see different results from the three different comparisons listed above.

Second, I am concerned with some incoherence regarding how the effect works. The story in the paper, supported by the regression analysis, seems to be that what matters is long-term exposure. But, if so, I don’t see how the short-term longitudinal analysis in this paper is getting us to that. If effects of air pollution on cognition are long-term, then really this is all a big cross-sectional analysis, which brings up the usual issues of unobserved confounders, selection bias, etc., and the multiple measurements on each person is not really giving us identification at all.

What is the most important real-world data processing tip you’d like to share with others?

This question was in today’s jitts for our communication class. Here are some responses:

Invest the time to learn data manipulation tools well (e.g. tidyverse). Increased familiarity with these tools often leads to greater time savings and less frustration in future.

Hmm it’s never one tip.. I never ever found it useful to begin writing code especially on a greenfield project unless I thought of the steps to the goal. I often still write the code in outline form first and edit before entering in programming steps. Some other tips.
1. Choose the right tool for the right job. Don’t use C++ if you’re going to design a web site.
2. Document code well but don’t overdo it, and leave some unit tests or assertions inside a commented field.
3. Testing code will always show the presence of bugs not their absence ( Dijkstra) but that dosen’t mean you should be a slacker.
4. Keep it simple at first, you may have to rewrite the program several times if it’s something new so don’t optimize until you’re satisfied. Finally, If you can control the L1 cache, you can control the world (Sabini).

Just try stuff. Nothing works the first time and you’ll have to throw out your meticulous plan once you actually start working. You’ll find all the hiccups and issues with your data the more time you actually spend in it.

Consider the sampling procedure and the methods (specifics of the questionnaire etc.) of data collection for “real-world” data to avoid any serious biases or flaws.

Quadruple-check your group by statements and joins!!

Cleaning data properly is essential.

Write a script to analyze the data. Don’t do anything “manually”.

Don’t be afraid to confer with others. Even though there’s often an expectation that we all be experts in all things data processing, the fact is that we all have different strengths and weaknesses and it’s always a good idea to benefit from others’ expertise.

For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

While converting data types, for example, extracting integers or convert to date, always check the agreement between data before and after convention. Sometimes when I was converting levels to integers, (numerical values somehow are recorded as categorical because of the existence of NA), there are errors and the results are not what I expected (e.g. convert “3712” to “1672”).

Learn dplyr.

Organisation of files and ideas are vital – constantly leave reminders of what you were doing and why you made particular choices either within the file names (indicating perhaps the date in which the code or data was updated) or within comments throughout the code that explain why you made certain decisions.

Thanks, kids!

“How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions” . . . and still stays around even after it’s been retracted

Chuck Jackson points to two items of possible interest:

Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, by Richard Harris. Review here by Leonard Freedman.

Retractions do not work very well, by Ken Cor and Gaurav Sood. This post by Tyler Cowen brought this paper to my attention.

Here’s a quote from Harris’s review:

Harris shows both sides of the reproducibility debate, noting that many eminent members of the research establishment would like to see this new practice of airing the scientific community’s dirty laundry quietly disappear. He describes how, for example, in the aftermath of their 2012 paper demonstrating that only 6 of 53 landmark studies in cancer biology could be reproduced, Glenn Begley and Lee Ellis were immediately attacked by some in the biomedical research aristocracy for their “naïveté,” their “lack of competence” and their “disservice” to the scientific community.

“The biomedical research aristocracy” . . . I like that.

From Cor and Sood’s abstract:

Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article.

I’m reminded of this story: “A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.”

This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine, indeed more appropriate and scientific, for the same reason that we don’t say that Steph Curry is 6’2.84378″ tall.

Emile Bravo and agency

I was reading Tome 4 of the adventures of Jules (see the last item here), and it struck me how much agency the characters had. They seemed to be making their own decisions, saying what they wanted to say, etc.

Just as a contrast, I’m also reading an old John Le Carre book, and here the characters have no agency at all. They’re just doing what is necessary to make the plot run. For Le Carre, that’s fine; the plot’s what it’s all about. So that’s an extreme case.

Anyway, I found the agency of Bravo’s characters refreshing. It’s not something I think about so often when reading, but this time it struck me.

P.S. I wrote about agency a few years ago in the context of Benjamin Kunkel’s book Indecision. I did a quick search and it doesn’t look like Kunkel has written much since. Too bad. But maybe he’s doing a Klam and it will be all right.

Research topic on the geography of partisan prejudice (more generally, county-level estimates using MRP)

1. An estimate of the geography of partisan prejudice

My colleagues David Rothschild and Tobi Konitzer recently published this MRP analysis, “The Geography of Partisan Prejudice: A guide to the most—and least—politically open-minded counties in America,” written up by Amanda Ripley, Rekha Tenjarla, and Angela He.

Ripley et al. write:

In general, the most politically intolerant Americans, according to the analysis, tend to be whiter, more highly educated, older, more urban, and more partisan themselves. This finding aligns in some ways with previous research by the University of Pennsylvania professor Diana Mutz, who has found that white, highly educated people are relatively isolated from political diversity. They don’t routinely talk with people who disagree with them; this isolation makes it easier for them to caricature their ideological opponents. . . . By contrast, many nonwhite Americans routinely encounter political disagreement. They have more diverse social networks, politically speaking, and therefore tend to have more complicated views of the other side, whatever side that may be. . . .

The survey results are summarized by this map:

I’m not a big fan of the discrete color scheme, which creates all sorts of discretization artifacts—but let’s leave that for another time. In future iterations of this project we can work on making the map clearer.

There are some funny things about this map and I’ll get to them in a moment, but first let’s talk about what’s being plotted here.

There are two things that go into the above map: the outcome measure and the predictive model, and it’s all described this post from David and Tobi.

First, the outcome. They measured partisan prejudice by asking 14 partisan-related questions, from “How would you react if a member of your immediate family married a Democrat?” to “How well does the term ‘Patriotic’ describe Democrats? to “How do you feel about Democratic voters today?”, asking 7 questions about each of the two parties and then fitting an item-response model to score each respondent who is a Democrat or Republican on how tolerant, or positive, they are about the other party.

Second, the model. They took data from 2000 survey responses and regressed these on individual and neighborhood (census block)-level demographic and geographic predictors to construct a model to implicitly predict “political tolerance” for everyone in the country, and then they poststratified, summing these up over estimated totals for all demographic groups to get estimates for county averages, which is what they plotted.

Having done the multilevel modeling and poststratification, they could plot all sorts of summaries, for example a map of estimated political tolerance just among whites, or a scatterplot of county-level estimated political tolerance vs. average education at the county level, or whatever. But we’ll focus on the map above.

2. Two concerns with the map and how it’s constructed

People have expressed two concerns about David and Tobi’s estimates.

First, the inferences are strongly model-based. If you’re getting estimates for 3000 counties from 2000 respondents—or even from 20,000 respondents, or 200,000—you’ll need to lean on a model. As a results, the map should not be taken to represent independent data within each county; rather, it’s a summary of a national-level model including individual and neighborhood (census block-level) predictors. As such, we want to think about ways of understanding and evaluating this model.

Second, the map shows some artifacts at state borders, most notably with Florida, South Carolina, New York state, South Dakota, Utah, and Wisconsin, also some suggestive patterns elsewhere such as the borders between Virginia and North Carolina, and Missouri and Arkansas. I’m not sure about all these—as noted above, the discrete color scheme can create apparent patterns from small variation, and there are real differences in political cultures between states (Utah comes to mind)—but there are definitely some problems here, problems which David and Tobi attribute to differences between states in the voter files that are used to estimate the total number of partisans (Democrats and Republicans) in each demographic category in each county. If the voter files for neighboring states are coming from different sorts of data, this can introduce apparent differences in the poststratification stage. Their counting problems are especially cumbersome because we have to estimate the total number of partisans in each demographic category in each county

3. Four plans for further research

So, what to do about these concerns? I have four ideas, all of which involve some mix of statistics and political science research, along with good old data munging:

(a) Measurement error model for differences between states in classifications. The voter files have different meanings in different states? Model it, with some state effects that are estimated from the data and using whatever additional information we can find on the measurement and classification process.

(b) Varying intercept model plus spatial correlation as a fix to the state boundary problems. This is kind of a light, klugey version of the above option. We recognize that some state-level fix is needed, and instead of modeling the measurement error or coding differences directly, we throw in a state-level error term, along with a spatial correlation penalty term to enforce similarity across county boundaries (maybe only counting counties that are similar in certain characteristics such as ethnic breakdown and proportion urban/suburban/rural).

(c) Tracking down exactly what happened to create those artifacts at the state boundaries. Before or after doing the modeling to correct the glaring boundary artifacts, it would be good to do some model analysis to work out the “trail of breadcrumbs” explaining exactly how the particular artifacts we see arose, to connect the patterns on the map with what was going on in the data.

(d) Fake-data simulation to understand scenarios where the MRP approach could fail. As noted in point 2 above, there are legitimate concerns about the use of any model-based approach to draw inferences for 3000 counties from 2000 (or even 20,000 or 200,000) respondents. One way to get a sense of potential problems here is to construct some fake-data worlds in which the model-based estimates will fail.

OK, so four research directions here. My inclination is to start with (b) and (d) because I’m kind of intimidated by the demographic classifications in the voter file, so I’d rather just consider them as a black box and try to fix them indirectly, rather than to model and understand them. Along similar lines, it seems to me that solving (b) and (d) will give us general tools that can be used in many other adjustment problems in sampling and causal inference. That said, (a) is appealing because it’s all about doing things right, and it could have real impact on future studies using the voter file, and (c) would be an example of building bridges between different models in statistical workflow, which is an idea I’ve talked about a lot recently, so I’d like to see that too.

“Heckman curve” update: The data don’t seem to support the claim that human capital investments are most effective when targeted at younger ages.

David Rea and Tony Burton write:

The Heckman Curve describes the rate of return to public investments in human capital for the disadvantaged as rapidly diminishing with age. Investments early in the life course are characterised as providing significantly higher rates of return compared to investments targeted at young people and adults. This paper uses the Washington State Institute for Public Policy dataset of program benefit cost ratios to assess if there is a Heckman Curve relationship between program rates of return and recipient age. The data does not support the claim that social policy programs targeted early in the life course have the largest returns, or that the benefits of adult programs are less than the cost of intervention.

Here’s the conceptual version of the curve, from a paper published by economist Heckman in 2006:

This graph looks pretty authoritative but of course it’s not directly data-based.

As Rea and Burton explain, the curve makes some sense:

Underpinning the Heckman Curve is a comprehensive theory of skills that encompass all forms of human capability including physical and mental health . . .

• skills represent human capabilities that are able to generate outcomes for the individual and society;

• skills are multiple in nature and cover not only intelligence, but also non cognitive skills, and health (Heckman and Corbin, 2016);

• non cognitive skills or behavioural attributes such as conscientiousness, openness to experience, extraversion, agreeableness and emotional stability are particularly influential on a range of outcomes, and many of these are acquired in early childhood;

• early skill formation provides a platform for further subsequent skill accumulation . . .

• families and individuals invest in the costly process of building skills; and

• disadvantaged families do not invest sufficiently in their children because of information problems rather than limited economic resources or capital constraints (Heckman, 2007; Cunha et al., 2010; Heckman and Mosso, 2015).

Early intervention creates higher returns because of a longer payoff over which to generate returns.

But the evidence is not so clear. Rea and Burton write:

The original papers that introduced the Heckman Curve cited evidence on the relative return of human capital interventions across early childhood education, schooling, programs for at-risk youth, university and active employment and training programs (Heckman, 1999).

I’m concerned about these all being massive overestimates because of the statistical significance filter (see for example section 2.1 here or my earlier post here). The researchers have every motivation to exaggerate the effects of these interventions, and they’re using statistical methods that produce exaggerated estimates. Bad combination.

Rea and Burton continue:

A more recent review by Heckman and colleagues is contained in an OECD report Fostering and Measuring Skills: Improving Cognitive and Non-Cognitive Skills to Promote Lifetime Success (Kautz et al., 2014). . . . Overall 27 different interventions were reviewed . . . twelve had benefit cost ratios reported . . . Consistent with the Heckman Curve, programs targeted to children under five have an average benefit cost ratio of around 7, while those targeted at older ages have an average benefit cost ratio of just under 2.


This result is however heavily influenced by the inclusion of the Perry Preschool programme and the Abecedarian Project. These studies are somewhat controversial in the wider literature . . . Many researchers argue that the Perry Preschool programme and the Abecedarian Project do not provide a reliable guide to the likely impacts of early childhood education in a modern context . . .

Also the statistical significance filter. A defender of those studies might argue that these biases don’t matter because they could be occurring for all studies, not just early childhood interventions. But these biases can be huge, and in general it’s a mistake to ignore huge biases in the vague hope that they may be canceling out.


The data on programs targeted at older ages do not appear to be entirely consistent with the Heckman Curve. In particular the National Guard Challenge program and the Canadian Self-Sufficiency Project provide examples of interventions targeted at older age groups which have returns that are larger than the cost of funds.

Overall the programs in the OECD report represent only a small sample of the human capital interventions with well measured program returns . . . many rigorously studied and well known interventions are not included.

So Rea and Burton decide to perform a meta-analysis:

In order to assess the Heckman Curve we analyse a large dataset of program benefit cost ratios developed by the Washington State Institute for Public Policy.

Since the 1980s the Washington State Institute for Public Policy has focused on evidence-based policies and programs with the aim of providing state policymakers with advice about how to make best use of taxpayer funds. The Institute’s database covers programs in a wide range of areas including child welfare, mental health, juvenile and adult justice, substance abuse, healthcare, higher education and the labour market. . . .

The August 2017 update provides estimates of the benefit cost ratios for 314 interventions. . . . The programs also span the life course with 10% of the interventions being aimed at children 5 years and under.

And here’s what they find:

Wow, that’s one ugly graph! Can’t you do better than that? I also don’t really know what to do with these numbers. Benefit-cost ratios of 90! That’s the kind of thing you see with, what, a plan to hire more IRS auditors? I guess what I’m saying is that I don’t know which of these dots I can really trust, which is a problem with a lot of meta-analyses (see for example here).

To put it another way: Given what I see in Rea and Burton’s paper, I’m prepared to agree with their claim that the data don’t support the diminishing-returns “Heckman curve”: The graph from that 2006 paper, reproduced at the top of this post, is just a story that’s not backed up by what is known. At that same time, I don’t know how seriously to take the above scatterplot, as many or even most of the dots there could be terrible estimates. I just don’t know.

In their conclusion, Rea and Burton say that their results do not “call into question the more general theory of human capital and skills advanced by Heckman and colleagues.” They express the view that:

Heckman’s insights about the nature of human capital are essentially correct. Early child development is a critical stage of human development, partly because it provides a foundation for the future acquisition of health, cognitive and non-cognitive skills. Moreover the impact of an effective intervention in childhood has a longer period of time over which any benefits can accumulate.

Why, then, do the diminishing returns of interventions not show up in the data? Rea and Burton write:

The importance of early child development and the nature of human capital are not the only factors that influence the rate of return for any particular intervention. Overall the extent to which a social policy investment gives a good rate of return depends on the assumed discount rate, the cost of the intervention, the interventions ability to impact on outcomes, the time profile of impacts over the life course, and the value of the impacts.

Some interventions may be low cost which will make even modest impacts cost effective.

The extent of targeting and the deadweight loss of the intervention are also important. Some interventions may be well targeted to those who need the intervention and hence offer a good rate of return. Other interventions may be less well targeted and require investment in those who do not require the intervention. A potential example of this might be interventions aimed at reducing youth offending. While early prevention programs may be effective at reducing offending, they are not necessarily more cost effective than later interventions if they require considerable investment in those who are not at risk.

Another consideration is the proximity of an intervention to the time where there are the largest potential benefits. . . .

Another factor is that the technology or active ingredients of interventions differ, and it is not clear that those targeted to younger ages will always be more effective. . . .

In general there are many circumstances where interventions to deliver ‘cures’ can be as cost effective as ‘prevention’. Many aspects of life have a degree of unpredictability and interventions targeted as those who experience an adverse event (such as healthcare in response to a car accident) can plausibly be as cost effective as prevention efforts.

These are all interesting points.

P.S. I sent Rea some of these comments, and he wrote:

I had previously read your paper ‘The failure of the null hypothesis’ paper, and remember being struck by the para:

The current system of scientific publication encourages the publication of speculative papers making dramatic claims based on small, noisy experiments. Why is this? To start with, the most prestigious general-interest journals—Science, Nature, and PNAS—require papers to be short, and they strongly favor claims of originality and grand importance….

I had thought at the time that this applied to the original Heckman paper in Science.

I think we agree with your point about not being able to draw any positive conclusions from our data. The paper is meant to be more in the spirit of ‘here is an important claim that has been highly influential in public policy, but when we look at what we believe is a carefully constructed dataset, we don’t see any support for the claim’. We probably should frame it more about replication and an invitation for other researchers to try and do something similar using other datasets.

Your point about the underlying data drawing on effect sizes that are likely biased is something we need to reflect in the paper. But in defense of the approach, my assumption is that well conducted meta analysis (which Washington State Institute for Public Policy use to calculate their overall impacts) should moderate the extent of the bias. Searching for unpublished research, and including all robust studies irrespective of the magnitude and significance of the impact, and weighting by each studies precision, should overcome some of the problems? In their meta analysis, Washington State also reduce a studies contribution to the overall effect size if there is evidence of a conflict of interest (the researcher was also the program developer).

On the issue of the large effect sizes from the early childhood education experiments (Perry PreSchool and Abecedarian Project), the recent meta analysis of high quality studies by McCoy et al. (2017) was helpful for us.

Generally the later studies have shown smaller impacts (possibly because control group are now not so deprived of other services). Here is one of their lovely forest plots on grade retention. I am just about to go and see if they did any analysis of publication bias.

Treatment interactions can be hard to estimate from data.

Brendan Nyhan writes:

Per #3 here, just want to make sure you saw the Coppock Leeper Mullinix paper indicating treatment effect heterogeneity is rare.

My reply:

I guess it depends on what is being studied. In the world of evolutionary psychology etc., interactions are typically claimed to be larger than main effects (for example, that claim about fat arms and redistribution). It is possible that in the real world, interactions are not so large.

To step back a moment, I don’t think it’s quite right to say that treatment effect heterogeneity is “rare.” All treatment effects vary. So the question is not, Is there treatment effect heterogeneity?, but rather, How large is treatment effect heterogeneity? In practice, heterogeneity can be hard to estimate, so all we can say is that, whatever variation there is in the treatment effects, we can’t estimated it well from the data alone.

In real life, when people design treatments, they need to figure out all sorts of details. Presumably the details matter. These details are treatment interactions, and they’re typically designed entirely qualitatively, which makes sense given the difficulty of estimating their effects from data.

“The Long-Run Effects of America’s First Paid Maternity Leave Policy”: I need that trail of breadcrumbs.

Tyler Cowen links to a research article by Brenden Timpe, “The Long-Run Effects of America’s First Paid Maternity Leave Policy,” that begins as follows:

This paper provides the first evidence of the effect of a U.S. paid maternity leave policy on the long-run outcomes of children. I exploit variation in access to paid leave that was created by long-standing state differences in short-term disability insurance coverage and the state-level roll-out of laws banning discrimination against pregnant workers in the 1960s and 1970s. While the availability of these benefits sparked a substantial expansion of leave-taking by new mothers, it also came with a cost. The enactment of paid leave led to shifts in labor supply and demand that decreased wages and family income among women of child-bearing age. In addition, the first generation of children born to mothers with access to maternity leave benefits were 1.9 percent less likely to attend college and 3.1 percent less likely to earn a four-year college degree.

I was curious so I clicked through and took a look. It seems that the key comparisons are at the state-year level, with some policy changes happening in different states at different years. So what I’d like to see are some time series for individual states and some scatterplots of state-years. Some other graphs, too, although I’m not quite sure what. The basic idea is that this is an observational study in which the treatment is some policy change, so we’re comparing state-years with and without this treatment; I’d like to see a scatterplot of the outcome vs. some pre-treatment measure, with different symbols for treatment and control cases. As it is, I don’t really know what to make of the results, what with all the processing that has gone on between the data and the estimate.

In general I am skeptical about results such as given in the above abstract because there are so many things that can affect college attendance. Trends can vary by state, and this sort of analysis will simply pick up whatever correlation there might be, between state-level trends and the implementation of policies. There are lots of reasons to think that the states where a given policy would be more or less likely to be implemented, happen to be states where trends in college attendance are higher or lower. This is all kind of vague because I’m not quite sure what is going on in the data—I didn’t notice a list of which states were doing what. My general point is that to understand and trust such an analysis I need a “trail of bread crumbs” connecting data, theory, and conclusions. The theory in the paper, having to do with economic incentives and indirect effects, seemed a bit farfetched to me but not impossible—but it’s not enough for me to just have the theory and the regression table; I really need to understand where in the data the result is coming from. As it is, this just seems like two state-level variables that happen to be correlated. There might be something here; I just can’t say.

P.S. Cowen’s commenters express lots of skepticism about this claim. I see this skepticism as a good sign, a positive aspect of the recent statistical crisis in science that people do not automatically accept this sort of quantitative claim, even when it is endorsed by a trusted intermediary. I suspect that Cowen too is happy that his readers read him critically and don’t believe everything he posts!

What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?

The punch line

“Your readers are my target audience. I really want to convince them that it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.”

The background

It started with an email from Erik van Zwet, who wrote:

In 2013, you wrote about the hidden dangers of non-informative priors:

Finally, the simplest example yet, and my new favorite: we assign a non-informative prior to a continuous parameter theta. We now observe data, y ~ N(theta, 1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 0.84 that theta>0. I don’t believe that 0.84. I think (in general) that it is too high.

I agree – at least if theta is a regression coefficient (other than the intercept) in the context of the life sciences.

In this paper [which has since been published in a journal], I propose that a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator. The posterior is the normal distribution with mean y/2 and standard deviation SE/sqrt(2). So that’s a default Edlin factor of 1/2. I base my proposal on two very different arguments:

1. The uniform (flat) prior is considered by many to be non-informative because of certain invariance properties. However, I argue that those properties break down when we reparameterize in terms of the sign and the magnitude of theta. Now, in my experience, the primary goal of most regression analyses is to study the direction of some association. That is, we are interested primarily in the sign of theta. Under the prior I’m proposing, P(theta > 0 | y) has the standard uniform distribution (Theorem 1 in the paper). In that sense, the prior could be considered to be non-informative for inference about the sign of theta.

2. The fact that we are considering a regression coefficient (other than the intercept) in the context of the life sciences is actually prior information. Now, almost all research in the life sciences is listed in the MEDLINE (PubMed) database. In the absence of any additional prior information, we can consider papers in MEDLINE that have regression coefficients to be exchangeable. I used a sample of 50 MEDLINE papers to estimate the prior and found the normal distribution with mean zero and standard deviation 1.28*SE. The data and my analysis are available here.

The two arguments are very different, so it’s nice that they yield fairly similar results. Since published effects tend to be inflated, I think the 1.28 is somewhat overestimated. So, I end up recommending the N(0,SE^2) as default prior.

I think it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.

Hmmm . . . one way to think about this idea is to consider where it doesn’t make sense. You write, “a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator.” Let’s consider two cases where this default won’t work:

– The task is to estimate someone’s weight with one measurement on a scale where the measurements have standard deviation 1 pound, and you observe 150 pounds. You’re not going to want to partially pool that all the way to 75 pounds. The point here, I suppose, is that the goal of the measurement is not to estimate the sign of the effect. But we could do the same reasoning where the goal was to estimate the sign. For example, I weigh you, then I weigh you again a year later. I’m interested in seeing if you gained or lost weight. The measurement was 150 pounds last year and 140 pounds this year. The classical estimate of the difference of the two measurements is 10 +/- 1.4. Would I want to partially pool that all the way to 5? Maybe, in that these are just single measurements and your weight can fluctuate. But that can’t be the motivation here, because we could just as well take 100 measurements at one time and 100 measurements a year later, so now maybe your average is, say, 153 pounds last year and 143 pounds this year: an estimated change of 10 +/- 0.14. We certainly wouldn’t want to use a super-precise prior with mean 0 an sd 0.14 here!

– The famous beauty-and-sex-ratio study where the difference in probability of girl birth, comparing children of beautiful and non-beautiful parents, was estimated from some data to be 8 percentage points +/- 3 percentage points. In this case, an Edlin factor of 0.5 is not enough. Pooling down to 4 percentage points is not enough pooling. A better estimate would of the difference be 0 percentage points, or 0.01 percentage points, or something like that.

I guess what I’m getting at is that the balance between prior and data changes as we get more information, so I don’t see how a fixed amount of partial pooling can work.

That said, maybe I’m missing something here. After all, a default can never cover all cases, and the current default of no partial pooling or flat prior has all sorts of problems. So we can think more about this.

P.S. In the months since I wrote the above post, Zwet sent along further thoughts:

Since I emailed you in the fall, I’ve continued thinking about default priors. I have a clearer idea now about what I’m trying to do:

In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular. Second, the literature is not always reliable because of publication bias and such. Third, it is generally unclear what the scope of the meta-analysis should be.

Now, researchers often want to be “objective” or “non-informative”. I believe this can be accomplished by performing a meta-analysis with a very wide scope. One might think that this would lead to very diffuse priors, but that turns out not to be the case! Using a very wide scope to obtain prior information also means that the same meta-analysis can be recycled in many situations.

The problem of publication bias in the literature remains, but there may be ways to handle that. In the paper I sent earlier, I used p-values from univariable regressions that were used to “screen” variables for a multivariable model. I figure that those p-values should be largely unaffected by selection on significance, simply because that selection is still to be done!

More recently, I’ve used a set of “honest” p-values that were generated by the Open Science Collaboration in their big replication project in psychology (Science, 2015). I’ve estimated a prior and then computed type S and M errors. I attach the results together with the (publicly available) data. The results are also here.

Zwet’s new paper is called Default prior for psychological research, and it comes with two data files, here and here.

It’s an appealing idea, in practice should be better than the current default Edlin factor of 1 (that is, no partial pooling toward zero at all). And I’ve talked a lot about constructing default priors based on empirical information, so it’s great to see someone actually doing it. Still, I have some reservations about the specific recommendations, for the reasons expressed in my response to Zwet above. Like him, I’m curious about your thoughts on this.

I’ll also wrote something on this in our Prior Choice Recommendations wiki:

Default prior for treatment effects scaled based on the standard error of the estimate

Erik van Zwet suggests an Edlin factor of 1/2. Assuming that the existing or published estimate is unbiased with known standard error, this corresponds to a default prior that is normal with mean 0 and sd equal to the standard error of the data estimate. This can’t be right–for any given experiment, as you add data, the standard error should decline, so this would suggest that the prior depends on sample size. (On the other hand, the prior can often only be understood in the context of the likelihood;, so we can’t rule out an improper or data-dependent prior out of hand.)

Anyway, the discussion with Zwet got me thinking. If I see an estimate that’s 1 se from 0, I tend not to take it seriously; I partially pool it toward 0. So if the data estimate is 1 se from 0, then, sure, the normal(0, se) prior seems reasonable as it pools the estimate halfway to 0. But if the data estimate is, say, 4 se’s from zero, I wouldn’t want to pool it halfway: at this point, zero is not so relevant. This suggests something like a t prior. Again, though, the big idea here is to scale the prior based on the standard error of the estimate.

Another way of looking at this prior is as a formalization of what we do when we see estimates of treatment effects. If the estimate is only 1 standard error away from zero, we don’t take it too seriously: sure, we take it as some evidence of a positive effect, but far from conclusive evidence–we partially pool it toward zero. If the estimate is 2 standard errors away from zero, we still think the estimate has a bit of luck to it–just think of the way in which researchers, when their estimate is 2 se’s from zero, (a) get excited and (b) want to stop the experiment right there so as not to lose the magic–hence some partial pooling toward zero is still in order. And if the estimate is 4 se’s from zero, we just tend to take it as is.

I sent some of the above to Zwet, who replied:

I [Zwet] proposed that default Edlin factor of 1/2 only when the estimate is less than 3 se’s away from zero (or rather, p<0.001). I used a mixture of two zero-mean normals; one with sd=0.68 and the other with sd=3.94. I’m quite happy with the fit. The shrinkage is a little more than 1/2 when the estimate is close to zero, and disappears gradually for larger estimates. It’s in the data! You can see it when you do a “wide scope” meta-analysis.