Maintenance cost is quadratic in the number of features

Bob Carpenter shares this story illustrating the challenges of software maintenance. Here’s Bob:

This started with the maintenance of upgrading to the new Boost version 1.69, which is this pull request:

https://github.com/stan-dev/math/pull/1082

for this issue:

https://github.com/stan-dev/math/issues/1081

The issue happens first, then the pull request, then the fun of debugging starts.

Today’s story starts an issue from today [18 Dec 2018] reported by Daniel Lee, the relevant text of which is:

@bgoodri, it looks like the unit tests for integrate_1d is failing. It looks like the new version of Boost has different behavior then what was there before.

This is a new feature (1D integrator) and it already needs maintenance.

This issue popped up when we updated Boost 1.68 to Boost 1.69. Boost is one of only three C++ libraries we depend on, but we use it everywhere (the other two libaries are limited to matrix operations and solving ODEs). Boost has been through about 20 versions since we started the project—twice or three times/year.

Among other reasons, we have to update Boost because we have to keep in synch with CRAN package BH (Boost headers) due to CRAN maximum package size limitations. We can’t distribute our own version of Boost so as to control the terms of when these maintenance events happen, but we’d have to keep updating anyway just to keep up with Boost’s bug fixes and new features, etc.

What does this mean in practical terms? Messages like the one above pop up. I get flagged, as does everyone else following the math lib issues. Someone has to create a GitHub issue, create a GitHub branch, debug the problem on the branch, create a GitHub pull request, get that GitHub pull request to pass tests on all platforms for continuous integration, get the code reviewed, make any updates required by code review and test again, then merge. This is all after the original issue and pull request to update Boost. That was just the maintenance that revealed the bug.

This is not a five minute job.

It’ll take one person-hour minimum with all the GitHub overhead
and reviewing. And it’ll take something like a compute-day on our continuous integration servers if it passes the tests (less for failures). Deubgging may take anywhere from 10 minutes to a day or maybe two in the extreme.

My point is just that the more things we have like integrate_1d, the more of these things come up. As a result, maintenance cost is quadratic in the number of features.

Bob summarizes:

It works like this:

Let’s suppose a maintenance event comes up every 2 months or
so (e.g., new version of Boost, reorg of repo, new C++ version etc.). For each maintenance event, the amount of maintenance we have to do is proportional to the number of features we have. As a result, the amount of maintenance we have to do is quadratic (e.g., a linear growth in features looks like this: 1 + 2 + 3 + … + and we do maintenance at regular intervals, so the amount of time it takes is quadratic.

This is why I’m always so reluctant to add features, especially when they have complicated dependencies.

That illusion where you think the other side is united and your side is diverse

Lots of people have written about this illusion of perspective: The people close to you look to be filled with individuality and diversity, while the people way over there in the other corner of the room all look kind of alike.

But widespread knowledge of this illusion does not stop people from succumbing from it. Here’s Michael Tomasky writing in the New York Times about what if America had a proportional-representation voting system:

Let’s just imagine that we had a pure parliamentary system in which we elected our representatives by proportional representation, so that if a minor party’s candidates got 4 percent of the legislative votes, they’d win 4 percent of the seats. What might our party alignment look like?

He identifies six hypothetical parties: the center left, the socialist left, the green left, a party for ethnic and lifestyle minorities, a white nationalist party, and a center-right party. Thus, Tomasky continues:

If I’m right, the Democrats would split into four parties, and the Republicans into two, although the second one would be tiny. In other words: The Trump-era Republican Party already is in essence a parliamentary party. . . .

The Democrats, however, are an unruly bunch. . . . The Democrats will never be a party characterized by parliamentary discipline; unlike the Republicans, their constituencies are too heterogeneous.

When it comes to racial/ethnic diversity, sure, the two parties are much different, with Democrats being much more of a coalition of groups and the Republicans being overwhelmingly white. More generally, though, no, I don’t buy Tomasky’s argument. He’s a liberal Democrat, so from his perspective his side is full of different opinions and argumentation. But I think that a columnist coming from the opposite side of the political spectrum would see it the other way, noticing all the subtleties in the Republican position. Overall, the Democrats and Republicans each receive about 30% of the vote (with typically a slightly higher percentage for the Democrats), with the other 40% voting for other parties or, mostly, not voting at all. I don’t think it makes sense to say that one group of 30% could support four different parties with the other group of 30% only supporting two. Even though I can see how it would look like that from one side.

Gremlin time: “distant future, faraway lands, and remote probabilities”

Chris Wilson writes:

It appears that Richard Tol is still publishing these data, only now fitting a piecewise linear function to the same data-points.
https://academic.oup.com/reep/article/12/1/4/4804315#110883819

Also still looks like counting 0 as positive, “Moreover, the 11 estimates for warming of 2.5°C indicate that researchers disagree on the sign of the net impact: 3 estimates are positive and 8 are negative. Thus it is unclear whether climate change will lead to a net welfare gain or loss.”

This is a statistically mistaken thing for Tol to do, to use a distribution of point estimates to make a statement about what might happen. To put it another way: suppose all 11 estimates were negative. That alone would not mean that it would be clear that climate change would lead to a net welfare loss. Even setting aside that “welfare loss” is not, and can’t be, clearly defined, the 11 estimates can—indeed, should—be correlated.

Tol’s statement is also odd if you look at his graph:

As Wilson notes, even if you take that graph at face value (which I don’t think you should, for reasons we’ve discussed before on this blog), what you really have is 1 positive point, several points that are near zero (but one of those points corresponds to a projection of global cooling so it’s not relevant to this discussion), and several more points that are negative. And, as we’ve discussed earlier, all the positivity is being driven by one single point, which is Tol’s own earlier study.

Tol’s paper also says:

This review of estimates in the literature indicates that the impact of climate change on the economy and human welfare is likely to be limited, at least in the twenty-first century. . . . negative impacts will be substantially greater in poorer, hotter, and lower-lying countries . . . climate change would appear to be an important issue primarily for those who are concerned about the distant future, faraway lands, and remote probabilities.

I’m surprised to see this sort of statement in a scientific journal. “Faraway lands”?? Who talks like that? I looked up the journal description and found this:

The Review of Environmental Economics and Policy is the official journal of the Association of Environmental and Resource Economists and the European Association of Environmental and Resource Economists.

So I guess they are offering a specifically European perspective. Europe is mostly kinda cold, so global warming is mostly about faraway lands. Still seems kinda odd to me.

P.S. Check out the x-axis on the above graph. “Centigrade” . . . Wow—I didn’t know that anyone still used that term!

The Arkansas paradox

Palko writes:

I had a recent conversation with a friend back in Arkansas who gives me regular updates of the state and local news. A few days ago he told me about a poll that was getting a fair amount of coverage. (See also here, for example.) The poll showed that a number of progressive social issues like marriage equality for the first time were getting majority support in the state. This agrees with a great deal of anecdotal evidence I’ve observed which suggest a strange paradox in the state (and, I suspect, in much of the Bible Belt). We are seeing a simultaneous spike in tolerance and intolerance around the very same issues.

Don’t get me wrong. I’m not saying that a Russellville Arkansas has become a utopia of inclusiveness, but from a historical standpoint, the acceptance of people who are openly gay, or who are in an interracial relationship has never been higher in the area. At the same time, conservative media has achieved critical mass, racist and inflammatory rhetoric is at a 50 year high, and the reactionaries have gained full control of the government for the first time since at least the election of Sen. Fulbright.

Arkansas is getting redder in partisan terms while looking increasingly purple ideologically.

I’m not sure how to think about this one, so I’m bouncing it over to you, the readers.

Stan examples in Harezlak, Ruppert and Wand (2018) Semiparametric Regression with R

I saw earlier drafts of this when it was in preparation and they were great.

Jarek Harezlak, David Ruppert and Matt P. Wand. 2018. Semiparametric Regression with R. UseR! Series. Springer.

I particularly like the careful evaluation of variational approaches. I also very much like that it’s packed with visualizations and largely based on worked examples with real data and backed by working code. Oh, and there are also Stan examples.

Overview

From the authors:

Semiparametric Regression with R introduces the basic concepts of semiparametric regression and is focused on applications and the use of R software. Case studies are taken from environmental, economic, financial, medical and other areas of applications. The book contains more than 50 exercises. The HRW package that accompanies the book contains all of the scripts used in the book, as well as datasets and functions.

There’s a sample chapter linked from the book’s site. It’s the intro chapter with lots of examples.

R code

There’s a thorough site supporting the book with all the R code. R comes with its own warning label on the home page:

All of the examples and exercises in this book [Semiparametric Regression with R] depend on the R computing environment. However, since R is continually changing readers should regularly check the book’s News, Software Updates and Errata web-site.

You’ve got to respect the authors’ pragmatism and forthrightness. I’m pretty sure most of the lack of backward compatibility users experience in R is primarily due to contributed packages, not the language itself.

Background reading

The new book’s based on an earlier book by an overlapping set of authors:

D. Ruppert, M. P. Wand and R. J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.

Cost and Format

First the good news. You can buy a pdf. I wish more authors and published had this as an option. I want to read everything in pdf format on my iPad.

Now the bad news. The pdf is US$89.00 or $29.95 per chapter. The softcover book is US$119.99. The printed book’s a bit less on Amazon at US$109.29 as of today. I wonder who works out the pennies in these prices.

Here’s the Springer page for the book in case you want a pdf.

Sometimes the Columbia library has these Springer books available to download a chapter at a time as pdfs. I’ll have to check about this one when I’ve logged back into the network.

Do regression structures affect research capital? The case of pronoun drop. (also an opportunity to quote Bertrand Russell: This is one of those views which are so absurd that only very learned men could possibly adopt them.)

A linguist pointed me with incredulity to this article by Horst Feldmann, “Do Linguistic Structures Affect Human Capital? The Case of Pronoun Drop,” which begins:

This paper empirically studies the human capital effects of grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. By de‐emphasizing the significance of the individual, such languages may perpetuate ancient values and norms that give primacy to the collective, inducing governments and families to invest relatively little in education because education usually increases the individual’s independence from both the state and the family and may thus reduce the individual’s commitment to these institutions. Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative effects of pronoun‐drop languages. The individual‐level analysis uses data on 114,894 individuals from 75 countries over 1999‐2014. It establishes that speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the effect is substantial, particularly among females.

Another linguist saw this paper and asked if it was a prank.

I don’t think it’s a prank. I think it’s serious.

It would be easy, and indeed reasonable, to just laugh at this one and move on, to file it along other cross-country comparisons such as this—but I thought it could be instructive instead to take the paper seriously and see what went wrong.

I’m hoping these steps can be useful to students when trying to understand published research. Or, for that matter, when trying to understand their own regression.

So how can we figure out what’s really going on in this article?

To start with, the claimed effect is within-person (speaking a certain type of language affects your behavior) and within-country (speaking a certain type of language affects national values and norms), but all the data are observational and all the comparisons are between people and between countries. Thus, any causal interpretations are tenuous at best.

So we can start by rewriting the above abstract in descriptive terms. I’ll just repeat the empirical parts, and for convenience I’ll put my changes in bold

This paper empirically studies the correlation of human capital with grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. . . Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative correlations of pronoun‐drop languages with outcomes of interest after adjusting for various demographic variables. . . . speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the correlation is substantial, particularly among females.

OK, that helps a little.

Now we have to dig in a bit more. First, what’s a pronoun-drop language? Or, more to the point, which languages have pronoun drop and which don’t? I looked through the paper for a list of these languages ora map of where they are spoken. I didn’t see such a list or map, so I went to wikipedia and found this:

Among major languages, two of which might be called a pro-drop language are Japanese and Korean (featuring pronoun deletion not only for subjects, but for practically all grammatical contexts). Chinese, Slavic languages, and American Sign Language also exhibit frequent pro-drop features. In contrast, non-pro-drop is an areal feature of many northern European languages (see Standard Average European), including French, (standard) German, and English. . . . Most Romance languages (with the notable exception of French) are often categorised as pro-drop too, most of them only in the case of subject pronouns . . . Among the Indo-European and Dravidian languages of India, pro-drop is the general rule . . . Outside of northern Europe, most Niger–Congo languages, Khoisan languages of Southern Africa and Austronesian languages of the Western Pacific, pro-drop is the usual pattern in almost all linguistic regions of the world. . . . In many non-pro-drop Niger–Congo or Austronesian languages, like Igbo, Samoan and Fijian, however, subject pronouns do not occur in the same position as a nominal subject and are obligatory, even when the latter is present. . . .

Hmmmm, now things don’t seem so clear. Much will depend on how the languages are categorized.

The next thing we need, after we have a handle on the data, is a scatterplot. Actually a bunch of scatterplots. A scatterplot for each within-country analysis and a scatterplot for the between-country analysis. Outcome of interest on y-axis, predictor of interest on x-axis. OK, the within-country data will have to be plotted in a different way because the predictor and outcome are discrete, but something can be done there.

The point is, we need to see what’s going on. In the within-country analysis, where do we see this correlation and where do we not see it? In the between-country analysis, what countries are driving the correlation?

Again, the analysis is all descriptive, and that’s fine, but the point is we need to understand what we’re describing.

I have no idea if the causal claims in this paper are true—given what I’ve seen so far, I see no particular reason to believe the claims. But, in any case, if these patterns are interesting—and I have no idea on that either—then they’re worth understanding. The regression won’t give us understanding; it just chews up the data and gives meaningless claims such as “we find that the magnitude of the effect is substantial and slightly larger for women. Specifically, women who speak a pronoun drop language are 9‐11 percentage points less likely to have completed secondary or tertiary education than women who speak a non‐pronoun drop language. For men, the probability is 8‐10 percentage points.” That way lies madness. We—Science—can do better.

P.S. I scrolled down to the end of the paper and found this sentence which begins the final footnote:

Pronoun drop rules are not perfect measures of ancient collectivism.

Ya think? In all seriousness, who could think that pronoun drop rules are any sort of measure of “ancient collectivism” at all? As Bertrand Russell said, this is one of those views which are so absurd that only very learned men could possibly adopt them.

Post-Hoc Power PubPeer Dumpster Fire

We’ve discussed this one before (original, polite response here; later response, after months of frustration, here), but it keeps on coming.

Latest version is this disaster of a paper which got shredded by a zillion commenters on PubPeer. There’s lots of incompetent stuff out there in the literature—that’s the way things go; statistics is hard—but, at some point, when enough people point out your error, I think it’s irresponsible to keep on with it. Ultimately our duty is to science, not to our individual careers.

Olivia Goldhill and Jesse Singal report on the Implicit Association Test

A psychology researcher whom I don’t know writes:

In case you aren’t already aware of it, here is a rather lengthy article pointing out challenges to the Implicit Association Test.

What I found disturbing was this paragraph:

Greenwald explicitly discouraged me from writing this article. ‘Debates about scientific interpretation belong in scientific journals, not popular press,’ he wrote. Banaji, Greenwald, and Nosek all declined to talk on the phone about their work, but answered most of my questions by email.

This attitude seems similar to the one you have pointed out in the past, wherein certain professors (and sometimes editors) seem utterly unwilling to even countenance challenges or be open to debate about their work. This attitude strikes me as very unscientific. Oftentimes “outsiders” can recognize deficiencies where “insiders” cannot, because they come at things with a very different point of view.

My reply: I thought the linked news article, by Olivia Goldhill, was excellent.

I’ve been skeptical of the implicit association test for a long time; see for example this from 2008, long before I’d heard about any replication crisis in psychology or elsewhere.

And I agree that it’s disturbing when people say, “Debates about scientific interpretation belong in scientific journals, not popular press.” Scientists have no problem their work being uncritically discussed in the popular press, and they have no problem with the popular press speculating on the real-world implications of their important work. So why is the press suddenly shut out when the explorations turn critical? Especially given that Goldhill’s “popular press” article is much more thoughtful and sophisticated than most journal articles I’ve seen on these topics.

My correspondent replied:

Agreed! Imagine if this attitude had prevailed in the 1920s and Walter Lippmann had not been able to criticize the interpretation and use of IQ tests.

And then this, which really says it all:

In the off chance you mention this in your blog, please don’t mention who sent it to you—I don’t want to accidentally embroil myself in any controversy.

P.S. More here in this hard-hitting piece by Jesse Singal, including these bits:

The problem, as I [Singal] showed in a lengthy rundown of the many, many problems with the test published this past January, is that there’s very little evidence to support that claim that the IAT meaningfully predicts anything. In fact, the test is riddled with statistical problems . . .

One striking thing about the process of reporting that article was the extent to which Banaji tried to smear her critics . . . She also accused the test’s critics of having a “pathological focus” on black-white race relations and the black-white IAT for reasons that “will need to be dealt with by them in the presence of their psychotherapists or church leaders.”

This is the definition of a derailing tactic — shift the focus from critiques of the IAT itself, some of which in this case appeared in a flagship social-psych journal, to the ostensible moral and psychological failings of the critiquers.

Yes, I hate that tactic.

Singal continues:

The idea that journalists shouldn’t write about scientific controversies would have been highly questionable even before the replication crisis exploded onto the scene, but it’s hard to fathom why anyone would take this argument seriously in 2017. . . . Greenwald, of course, doesn’t appear to have any problems with positive coverage of the IAT.

And he concludes:

Society desperately needs more open scrutiny of scientific claims, not less, whether in scientific journals, the media, or anywhere else. Especially when it comes to claims that seem to change every two years.

I agree.

A thought on Bayesian workflow: calculating a likelihood ratio for data compared to peak likelihood.

Daniel Lakeland writes:

Ok, so it’s really deep into the comments and I’m guessing there’s a chance you will miss it so I wanted to point at my comments here and here.

In particular, the second one, which suggests something that it might be useful to recommend for Bayesian workflows: calculating a likelihood ratio for data compared to peak likelihood.

I imagine in Stan using generated quantities to calculate say the .001, .01, .1, .25, .5 quantiles of log(L(Data)/Lmax) or something like that and using this as a measure of model misfit on a routine basis. I think it would be useful to know for example that for a given posterior draw from the parameters, the least likely data points are no less than say 10^-4 times as likely as the data value at the mode of the likelihood, and you’d definitely like to know if some percentage of the data is 10^-37 times less likely 😉 that would flag some serious model mis-fitting.

Since the idea deserves elaboration, but the comments on this blog post are sort of played out, what do you think I should do with this idea? Is there a productive place to discuss it or maybe set up some kind of example vignette or something?

I don’t have the energy to think this through so I thought I’d just post it here. My only quick thought is that the whole peak likelihood thing can’t work in general because (a) the likelihood can blow up, and (b) wherever is the peak likelihood can be super-noisy at times. So I’d replace “peak likelihood” with something using an informative prior. But maybe there’s something to the rest of this?

Brakes

So. I noticed my rear brake wasn’t really doing anything. If I squeezed really hard, I could slow down, but not enough to stop going down a steep hill. No big deal—it’s the front brake that really matters, right?—but just for safety’s sake I went to the bike store one day and they replaced the pads so the brake works again. And, hey—it really works! I hadn’t realized how effective the brake can be when it’s fully operational. Good to know.