This one goes in the Zombies category, for sure.

Paul Alper writes:

I was in my local library and I came across this in Saturday’s WSJ:

The Math Behind Successful Relationships

Nearly 30 years ago, a mathematician and a psychologist teamed up to explore one of life’s enduring mysteries: What makes some marriages happy and some miserable?

The psychologist, John Gottman, wanted to craft a tool to help him better counsel troubled couples. The mathematician, James Murray, specialized in modeling biological processes. . . .

Hey, it’s the bullshit asymmetry principle!

Some background on this, from 2009 and from 2010:

Shooting down B.S. claims about divorce predictions, part 2 (Somewhere, Karl Popper is smiling ruefully)

Last year, we heard about “maths expert” and Oxford University prof who could predict divorces “with 94 per cent accuracy. . . His calculations were based on 15-minute conversations between couples.”

At the time, I expressed some skepticism because, amid all the news reports, I couldn’t find any description of exactly what they did. Also, as a statistician, I have some sense of the limitations of so-called “mathematical models” (or, worse, “computer models”).

Then today I ran across this article from Laurie Abraham shooting down this research in more details . . .

The garden of forking paths

Bert Gunter points us to this editorial:

So, researchers using these data to answer questions about the effects of technology [screen time on adolescents] need to make several decisions. Depending on the complexity of the data set, variables can be statistically analysed in trillions of ways. This makes almost any pattern of results possible. As a result, studies have suggested both the existence of and the lack of an association between screen time and well-being, even when analysing the same data set. Naturally, it’s the research that highlights possible dangers that receives the most public attention and helps to set the policy agenda.

It’s the multiverse. Good to see people recognizing this. As always, I think the right way to go is not to apply some sort of multiple comparison correction or screen for statistical significance or preregister or otherwise choose some narrow subset of results to report. Instead, I recommend studying all comparisons of interest using a multilevel model and displaying all these inferences together, accepting that there will be uncertainty in conclusions.

“This is a case where frequentist methods are simple and mostly work well, and the Bayesian analogs look unpleasant, requiring inference on lots of nuisance parameters that frequentists can bypass.”

Nick Patterson writes:

I am a scientist/data analyst, still working, who has been using Bayesian methods since 1972 (getting on for 50 years), I was initially trained at the British code-breaking establishment GCHQ, by intellectual heirs of Alan Turing.

I’ve been accused of being a Bayesian fanatic, but in fact a good deal of my work is frequentist—either because this is easier and “good enough” or because specifying a decent model seems too hard. Currently I work in genetics and here is an example of the latter problem. We want to know if there is strong evidence against a null that 2 populations have split off with basically no gene-flow after the split. Already this a bit hard to handle with Bayesian methods as we haven’t really got any kind of prior on how much flow might occur. But it is even harder than that. It turns out that the distribution of relevant statistics depends on the extent of “LD” in the genome, and the detailed structure of LD is uncertain. [LD is jargon for non-independence of close genomic regions.] I use a “block jackknife” which (largely) deals with this issue but is a frequentist technique. This is a case where frequentist methods are simple and mostly work well, and the Bayesian analogs look unpleasant, requiring inference on lots of nuisance parameters that frequentists can bypass.

I’d love to have a dialog here. In general I feel a lot of practical statistical problems/issues are messier than the textbook or academic descriptions imply.

My reply:

I don’t have experience in this genetics problem, but speaking in general terms I think the Bayesian version with lots of so-called nuisance parameters can work fine in Stan. The point is that by modeling these parameters you can do better than methods that don’t model them. Or, to put it another way, how does the non-Bayesian method finesse those “nuisance parameters”? If this is done by integrating them out, then you already have a probability model for these parameters, and you’re already doing some form of Bayesian inference, so then it just comes down to computation. If you’re bypassing the nuisance parameters without modeling them, then I suspect you’re throwing away some information.

This is not to say that the existing methods have problems. If your existing methods work well, fine. The statement, though, is that the existing methods “mostly” work well. So maybe the starting point is to focus on the problems where the existing methods don’t work well, as these are the problems where an investment in Bayesian modeling could pay off.

And recall footnote 1 of this paper.

Just forget the Type 1 error thing.

John Christie writes:

I was reading this paper by Habibnezhad, Lawrence, & Klein (2018) and came across the following footnote:

In a research program seeking to apply null-hypothesis testing to achieve one-off decisions with regard to the presence/absence of an effect, a flexible stopping-rule would induce inflation of the Type I error rate. Although our decision to double the N from 20 to 40 to reduce the 95% CI is not such a flexible stopping rule, it might increase the Type I error rate. That noted, we are not proposing any such one-off decisions, but instead seek to contribute to the cumulative evidence of the scientific process. Those seeking such decisions may consider the current report exploratory rather than confirmatory. (fn 2)

Given the recent strong recommendations by many against adding participants after looking at the result I wonder if you feel the footnote is sufficient or if you wanted to comment on it on your blog.

My quick reply is that I hate this type 1 error thing.

Let me explain in the context of a simple example. Consider two classical designs:

1. N=20 experiment

2. N=2,000,000 experiment.

Both these have “type 1 error rates” of 0.05, but experiment #2 will be much more likely to give statistical significance. Who cares about the type 1 error rate? I don’t. The null hypothesis of zero effect and zero systematic error is always false.

To put it another way: it’s completely fine to add participants after looking at the result. The goal should not be to get “statistical significance” or to get 95% intervals that exclude zero or whatever. Once you forget that, you can move forward.

But now let’s step back and consider the motivation for type 1 error control in the first place. The concern is that if you don’t control type 1 error, you’ll routinely jump to conclusions. I’d prefer to frame this in terms of type M (magnitude) and type S (sign) errors. I think the way to avoid jumping to unwarranted conclusions is by making each statement stand up on its own. To put it another way, I have no problem presenting a thousand 95% intervals, under the expectation that 50 will not contain the true value.

Concerned about demand effects in psychology experiments? Incorporate them into the design.

Johannes Haushofer sends along this article with Jonathan de Quidt and Christopher Roth, “Measuring and Bounding Experimenter Demand,” which begins:

We propose a technique for assessing robustness to demand effects of findings from experiments and surveys. The core idea is that by deliberately inducing demand in a structured way we can bound its influence. We present a model in which participants respond to their beliefs about the researcher’s objectives. Bounds are obtained by manipulating those beliefs with “demand treatments.” We apply the method to 11 classic tasks, and estimate bounds averaging 0.13 standard deviations, suggesting that typical demand effects are probably modest. We also show how to compute demand-robust treatment effects and how to structurally estimate the model.

I like the idea of measuring and understanding these through experimentation. This reminds me of an idea I had awhile ago to reduce certain cognitive biases by changing the way that questions are asked to better match the way people think. Instead of hoping that a certain bias doesn’t exist (or, worse, engaging in dismissive argumentation when the possibility is suggested), you try to include it in your experiment.

Swimming upstream? Monitoring escaped statistical inferences in wild populations.

Anders Lamberg writes:

In my mails to you [a few years ago], I told you about the Norwegian practice of monitoring proportion of escaped farmed salmon in wild populations. This practice results in a yearly updated list of the situation in each Norwegian salmon river (we have a total of 450 salmon rivers, but not all of them are surveyed – however, the most important ones are, more than 200 each year). There are several methods used to “extract” a sample and an estimate of proportion of farmed salmon from each surveyed river, and the big discussion has been: What does these methods give us? In practice it boils down to a statistical question: What is the precision of the estimates?

As I mentioned before, the number you get from a survey will have both ecological and economic implications. If the calculated proportion of escaped farmed salmon is above a defined limit, it can tell you about how the total productivity of the population of wild salmon in that river is affected. Wrong numbers may lead to the conclusion that there is no problem, when there is a problem that should be addressed. Proportions of farmed salmon above the critical limit, may also in the long run, lead to measures taken in the fish farming industry. New net pens that are more “escape safe”, reduction in the produced total volume of farmed fish, tagging of all farmed fish in the net pens in order to be able to track where escapees come from – are all good measures, but that cost a lot of money. That’s where it gets political. And to be slightly more specific: We do not speak millions, we speak billions.

In this drama you have several actors:
1. The government, democratically elected by the people. The ones who decide.
2. The government’s bureaucratic staff. The ones that gives advices to the ones that decide.
3. The scientists. The ones that produces the numbers for bureaucrats. But science in this context, is also a business. There are a lot of private or semi-private institutions participating in projects to acquire the relevant numbers. There are in fact no or few governmental employed scientists working with salmon, in Norway, only researchers in private companies.
4. The land owners that owns the rivers and earn money on sports fisheries tourism (too many escaped farmed salmon may results in fewer wild salmon)
5. The fish farmers that produce and sell farmed salmon
6. The lobbyists that speak on behalf of the different economic interests
a. The ones that are working for the fish farmers
b. The ones that are working for the land owners and sportfishing tourism
7. Environmental activists
a. Some work for the salmon itself and the future of their children (they have no direct economic incentive)
b. Some work under false flag and are working for the different groups of scientists or the industry (These lobbyists have economically gain from their effort)
8. The journalists
a. The ones that have friends in the fish farming industry (they may even own stock shares in the fish farming companies)
b. The ones that have friends among the land owners. These journalists often are keen sports fishers themselves

The big drama is often “powered” by the scientists/researchers and a discussion about numbers and methods among them. The outcome of these discussions will decide which scientist gets money and who is not. This also leads to different publications that each supports different scientist groups. One such publication appeared late 2016 where one of the most recognised statisticians in Norway was asked to evaluate what could be inferred from estimates with wide confidence intervals, originating from a sampling method with few individual salmon sampled from each river. The conclusion in the report was that the wide confidence intervals made it impossible to use the result as a tool for managing neither the fish farming industry nor the wild salmon. Although this highly qualified statistician concluded, the scientists that got their project portfolio reduced by his conclusions, denied to accept it. What followed next is the big drama, where journalists were fed with information from the hurt scientists. The journalists were not able to understand what they actually wrote, but thought they were on safe grounds, because the whole story made sense with the reference to what they themselves believed in. The government was also not able to understand the statistical implications, but probably picked the conclusion that was in line with their overall goal: Wild salmon or farming industry, depending on which political party they represented. At the same time they wanted to be sure that their concussions would not back fire on them, if new data emerged. This in turn lead to meetings, some secret, some official. This drama lead partly to a situation where a high ranked politician had to leave his position as a minister. The confidence interval discussion was not the sole reason, but it sure helped on his resignation.

So what can we learn from this? How can we establish a scientific fundament that works as a robust tool for the best of business and nature? I think the answer lies in the statistics and the correct interpretation and realising that the average researcher actually do not understand completely what tool he/she use. It does not help that Neyman published the answer in 1937, because an average biologist is not able to understand what Neyman writes. That is why the responsibility lies on you professional statisticians. Here is what I feel is not communicated in a sufficiently clear way.

In my textbooks there is a huge difference between a point estimate and an interval estimate. To illustrate what I mean, the use of the wild salmon farmed salmon example is good. When you sample 50 salmon from a river and 5 of them shows up to be farmed salmon, the researcher publishes the value 10 % farmed salmon. He/she also publishes a 95 % confidence interval, but that interval is put in brackets, behind the 10 %. It is left in the dark shadows of the middle value in the sampling distribution. It is here I think the big mistake emerges. If you calculate a 95 % confidence interval, this calculation is with reference to the method (Neyman 1937) not to the real value. What is done, is that the 10 % is referred to as a point estimate by the biologists. This may be formally right, but misleading. A point estimate is a different thing. If it is a point estimate, it is a really bad one – with low probability to match the real value. What we actually have done here, is producing an interval estimate. It is correct that “our best guess” is 10 %, but I mean that this value should never be reported. Only the interval and the confidence level should be reported. Politicians, researchers, journalists etc… who does not understand fully what the results from the sample gives us, will misunderstand.

It gets even worse if you plot the probability curve for the outcome of the sampling of the 50 salmon which includes the 5 farmed fish. The visualisation by use of a curve where the 10 % emerges in the middle, leads people to think that it is highly probable that the real value, the value that should guide us to make the right decisions, lies close to 10 %. This is not what actually has been found. If you do the whole sampling once more, if you sample 50 new salmon from that population, it is likely that you get a very different result. May be 1 %. Then the drawing of the curve will make a totally new picture, for those (and that is the vast majority – including professors and the prime minister) who is not able to read Neyman 1937 and fully grasp the concept, they believe that they have found something quite accurate, unless they have seen the first curve, with 10 % in the middle, and take into account that this is from the same population, they would think that the new result, the 1 %, is from a totally different population. But it is not.

The other way of looking at it is to calculate a 50 % (instead of 95 %) confidence interval from the sample that gave us for example 10 % farmed salmon in the population. Most will understand that 50 % confidence interval is notsomething you would rely on to make serious political decisions. This interval would also have the value 10 % in the middle. For a non-statistician there is no difference between the 95 % and the 50 % interval curve plotted. Without referring to the interval itself, and only that, it will not be understood.

I think that an average biologist (or other researcher) mixes the concepts. They are used to measure for example body length of fish in a sample. They use sample statistics and get an estimate of population mean length from their sample. The calculate the distribution and are quite confident that they have a good representation of the population mean. But, when they calculate a confidence interval from a sample of proportion, they think the same way. The middle value in the distribution is close to the real value. If the sample size is large, there is of course no big problem. They do not see that wide confidence intervals must be interpreted in a different way.

I think the solution to this problem is that it “should be forbidden” to report the middle value of a confidence interval. Only the limits of the interval and the confidence level, should be reported. In this way you would force the researchers to think the right way. You will also communicate better to the public, the journalists and the politicians, if the middle value in confidence interval was never mentioned again. The middle value tells you very little in small samples where you only sample once. It is acquired by a random process with high probability that it jumps back and forth, if you were able to sample many times.

This was quite radical, but have I misunderstood this completely?

My reply: I don’t know that it’s much of a solution to not report the middle of the interval. I say this for a few reasons: First, the endpoints of an interval will in general be noisier than the midpoint. Second, I think it’s a big mistake to make decisions or inferential summaries based on whether an interval excludes zero. Third, if you’re using classical nonregularized estimates, then the middle has problems but so does the endpoints; consider for example some of the estimated effects of early childhood intervention.

That said, I agree with the general points expressed above. I see two big issues:

1. Lots of people want certainty when it’s not appropriate. We have to have a way of saying something in between “I know nothing” and “I’m sure.”

2. When data are sparse, you should be able to better using prior information, but there’s lots of resistance to doing that.

The Economist does Mister P

Elliott Morris points us to this magazine article, “If everyone had voted, Hillary Clinton would probably be president,” which reports:

Close observers of America know that the rules of its democracy often favour Republicans. But the party’s biggest advantage may be one that is rarely discussed: turnout is just 60%, low for a rich country. Polls show that non-voters—both people uninterested in voting and those blocked by legal or economic hurdles—mainly belong to groups that tend to back Democrats.

What would change if America became the 22nd country to make voting mandatory? To estimate non-voters’ views, The Economist used the Co-operative Congressional Election Study (CCES), a 64,600-person poll led by Harvard University. . . .

Using a method called “multilevel regression and post-stratification”, the relationships between demography and vote choices can be used to project state-level election results—and to estimate what might have happened in the past under different rules.

And, from Morris’s writeup of the methodology:

We set out to use statistics to provide an empirical answer to the following question: what would happen in American presidential elections if voting was mandatory, as it is in some other countries?

This set off a months-long inquiry that was much more difficult than we anticipated. . . .

We soon discovered that none of the commonly used computational tools in our arsenal would do the trick. Simple summary statistics and regression analysis were obviously not enough; although we could use public polling to make predictions for individual citizens, the country doesn’t elect its president by popular vote. Even popular machine-learning algorithms were not sufficiently suited for the task. . . .

Ultimately, what we needed was a technique to make predictions for each state under varying degrees of voter turnout, to figure out the electoral-college winner under a system of mandatory voting. The method would have to account for many factors, such as increasing turnout among minorities, who vote less often but lean to the left, and higher turnout among whites without degrees, who lean to the right. We also had to answer the crucial question of whether a voter and a non-voter with the same demographic profile would vote in similar ways (for the most part, they do). More questions popped up along the way.

OK, at this point I think you can anticipate what’s coming, but I’ll tell you anyway:

A solution was lurking in the background, but The Economist had never attempted it before: a statistical method, popular among leading quantitative social scientists, called “multi-level regression and post-stratification” (MRP, or “Mr P” among its super-fans). It involves combining national polls with information about individual voters to make predictions at different geographic levels. . . .

With the CCES alone, we could assess the relationship between demographics, turnout and vote choice. But due to small sample sizes in select states . . . we could not make reliable state-level projections. . . .

He even includes rstanarm code!

My reply:

Regarding the substantive question, the last time I looked at what might happen if everyone voted was in 2007. As I recall, a key conclusion was that higher turnout would not just change the Democrat/Republican split, it would also change what issues would get discussed in politics.

Regarding your grid of maps: I like this sort of thing, and I like that you used a bi-directional color scheme of the sort that we have used. My only recommendation is for you to put the labels on the top and left of the grid rather than putting a label on each map. See the grid of maps from this 2010 paper for an example. Our grid from that 2010 paper is not perfect—in particular, our state borders are distractingly dark—but at least you can get the point regarding the labeling of rows and columns.

I also liked your comment, “this is not an endeavour for Ockhamites; there is danger in being too simplistic.” Here’s something I wrote about this awhile ago. Using a simplified model can be a sensible practical choice, and it should be understood as such.

Morris adds:

That Highton and Wolfinger article from 1999 [see link and discussion in 2007“>here] was key to our approach. If we remove the assumption that non-voters behavior like voters once we control for demographic and political variables, the whole thing falls apart. With our time constraints, we also found it impossible to suss out the downstream effects of universal turnout on things like party messaging and the median voter, but we took that a shift more toward Democrats implies a shift to the left. And more toward your point, this would come with increased salience in issues that matter to poor and non-white Americans, too, I think. We did game out how Republicans might go about achieving electoral success again and found that increasing their margins among non-college whites is the most bang-for-their-buck, given the size of the group (about half of all voters, by our math).

Also relevant: The Electoral College magnifies the power of white voters.

From deviance, DIC, AIC, etc., to leave-one-out cross-validation

Maren Vranckx writes:

I am writing in connection with a post on your blog on 22 June 2011 about “Deviance, DIC, AIC, cross-validation, etc”. In this post, you mentioned that you and a student worked about DIC convergence. Can you specify how you did the research? Did you discover a reason for slow convergence of DIC? Was the slow convergence for specific disease data? In which book is the example of zillion iterations to have convergence DIC stated? Did you perform research about this more recently?

My reply:

Some of our more recent thinking on this topic is here and here.

It’s good to know that we’re making progress in our understanding.

Plaig!

Tom Scocca discusses some plagiarism that was done by a former New York Times editor:

There was no ambiguity about it; Abramson clearly and obviously committed textbook plagiarism. Her text lifted whole sentences from other sources word for word, or with light revisions, presenting the same facts laid out in the same order as in the originals. . . .

How did this arise? According to Scocca, “partly because she got significant facts wrong, in the galley version of her book . . .”

This reminds me of something that Thomas Basbøll and I noticed awhile ago, that plagiarism and factual error seem to go together:

We propose that plagiarism is a statistical crime. It involves the hiding of important information regarding the source and context of the copied work in its original form. Such information can dramatically alter the statistical inferences made about the work.

I think it’s no coincidence that Abramson had errors as well as plagiarism: When you as an author destroy the paper trail, it’s harder for you as well as others to keep track of what’s really happening. A similar thing happened with Brian Wansink in his notorious retracted papers: the connections between his actual data and what he reported were so tangled that it seemed that he himself had no idea what was real and what was not.

It’s hard enough to keep straight what is happening even if you carefully document everything. So no surprise that if you can’t even source the writing in your book, you’ll get the facts wrong.

Remember what happened to Ed Wegman when he copied from Wikipedia and garbled the results?

P.S. Scocca quotes a law professor defending the plagiarism, which is pretty funny given that so many prominent law professors have themselves been accused of plagiarism (for example, here, here, and here), without it seeming to do much to their careers.

I think Scocca nails it when he writes:

Whether or not there are separate classes of writing, with separate value, there are definitely separate classes of writers.

It’s not the status of the words that defines the offense, it’s the status of the person who originally wrote the words compared to the person who copied them. . . . Fareed Zakaria, Doris Kearns Goodwin, Stephen Ambrose, Juan Williams—above a certain level, a public figure is immune to any real career consequences for stealing work from the lower castes.

We saw something similar in the case of mathematician Christian Hesse, who took material written by others and didn’t credit them. This was not plagiarizing because it was not the exact words that Chrissy copied, just content—but like Abramson, and for that matter Wegman, there was the same consequence that Chrissy by copying introduced errors into his writing. He didn’t take responsibility, and by not clearly labeling his sources he made it that much more difficult for readers to track down these stories.

“Developing Digital Privacy: Children’s Moral Judgments Concerning Mobile GPS Devices”

Recently in the sister blog:

New technology poses new moral problems for children to consider. We examined whether children deem object tracking with a mobile GPS device to be a property right. In three experiments, 329 children (4-10 years) and adults were asked whether it is acceptable to track the location of either one’s own or another person’s possessions using a mobile GPS device. Young children, like adults, viewed object tracking as relatively more acceptable for owners than nonowners. However, whereas adults expressed negative evaluations of someone tracking another person’s possessions, young children expressed positive evaluations of this behavior. These divergent moral judgments of digital tracking at different ages have profound implications for how concepts of digital privacy develop and for the digital security of children.

Video here.