Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good (Part 2)

Part 1 was here.

And here’s Part 2. Jordan Anaya reports:

Uli Schimmack posted this on facebook and twitter.

I [Anaya] was annoyed to see that it mentions “a handful” of unreliable findings, and points the finger at fraud as the cause. But then I was shocked to see the 85% number for the Many Labs project.

I’m not that familiar with the project, and I know there is debate on how to calculate a successful replication, but they got that number from none other than the “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” people, as Sanjay Srivastava discusses here.

Schimmack identifies the above screenshot as being from Myers and Twenge (2018); I assume it’s this book, which has the following blurb:

Connecting Social Psychology to the world around us. Social Psychology introduces students to the science of us: our thoughts, feelings, and behaviors in a changing world. Students learn to think critically about everyday behaviors and gain an appreciation for the world around us, regardless of background or major.

But according to Schimmack, there’s “no mention of a replication failure in the entire textbook.” That’s fine—it’s not necessarily the job of an intro textbook to talk about ideas that didn’t work out—but then why mention replications in the first place? And why try to minimize it by talking about “a handful of unreliable findings”? A handful, huh? Who talks like that. This is a “Politics and the English Language” situation, where sloppy language serves sloppy thinking and bad practice.

Also, to connect replication failures to “fraud” is just horrible, as it’s consistent with two wrong messages: (a) that to point out a failed replication is to accuse someone of fraud, and (b) that, conversely, honest researchers can’t have replication failures. As I’ve written a few zillion times, honesty and transparency are not enuf. As I wrote here, it’s a mistake to focus on “p-hacking” and bad behavior rather than the larger problem of researchers expecting routine discovery.

So, the blurb for the textbook says that students learn to think critically about everyday behaviors—but they won’t learn to think critically about published research in the field of psychology.

Just to be clear: I’m not saying the authors of this textbook are bad people. My guess is they just want to believe the best about their field of research, and enough confused people have squirted enough ink into the water to confuse them into thinking that the number of unreliable findings really might be just “a handful,” that 85% of experiments in that study replicated, that the replication rate in psychology is statistically indistinguishable from 100%, that elections are determined by shark attacks and college football games, that single women were 20 percentage points more likely to support Barack Obama during certain times of the month, that elderly-priming words make you walk slower, that Cornell students have ESP, etc etc etc. There are lots of confused people out there, not sure where to turn, so it makes sense that some textbook writers will go for the most comforting possible story. I get it. They’re not trying to mislead the next generation of students; they’re just doing their best.

There are no bad guys here.

Let’s just hope 2019 goes a little better.

A good start would be for the authors of this book to send a public note to Uli Schimmack thanking them for pointing out their error, and then replacing that paragraph with something more accurate in their next printing. They could also write a short article for Perspectives on Psychological Science on how they got confused on this point, as this could be instructive for other teachers of psychology. They don’t have to do this. They can do whatever they want. But this is my suggestion how they could get 2019 off to a good start, in one small way.

The post Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good (Part 2) appeared first on Statistical Modeling, Causal Inference, and Social Science.

Combining apparently contradictory evidence

I want to write a more formal article about this, but in the meantime here’s a placeholder.

The topic is the combination of apparently contradictory evidence.

Let’s start with a simple example: you have some ratings on a 1-10 scale. These could be, for example, research proposals being rated by a funding committee, or, umm, I dunno, gymnasts being rated by Olympic judges. Suppose there are 3 judges doing the ratings, and consider two gymnasts: one receives ratings of 8, 8, 8; the other is rated 6, 8, 10. Or, forget about ratings, just consider students taking multiple exams in a class. Consider two students: Amy, whose three test scores are 80, 80, 80; and Beth, who had scores 80, 100, 60. (I’ve purposely scrambled the order of those last three so that we don’t have to think about trends. Forget about time trends; that’s not my point here.)

How to compare those two students? A naive reader of test scores will say that Amy is consistent while Beth is flaky; or you might even say that you think Beth is better as she has a higher potential. But if you have some experience with psychometrics, you’ll be wary of overinterpreting results from three exam scores. Inference about an average from N=3 is tough; inference about variance from N=3 is close to impossible. Long story short: from a psychometrics perspective, there’s very little you can say about the relative consistency of Amy and Beth’s test-taking based on just three scores.

Academic researchers will recognize this problem when considering reviews of their own papers that they’ve submitted to journals. When you send in a paper, you’ll typically get a few reviews, and these reviews can differ dramatically in their messages.

Here’s a hilarious example supplied to me by Wolfgang Gaissmaier and Julian Marewski, from reviews of their 2011 article, “Forecasting elections with mere recognition from small, lousy samples: A comparison of collective recognition, wisdom of crowds, and representative polls.”

Here are some positive reviewer comments:

– This is a very interesting piece of work that raises a number of important questions related to public opinion. The major finding — that for elections with large numbers of parties, small non-probability samples looking only at party name recognition do as well as medium-sized probility samples looking at voter intent — is stunning.

– There is a lot to like about this short paper… I’m surprised by the strength of the results… If these results are correct (and I have no real reason to suspect otherwise), then the authors are more than justified in their praise of recognition-based forecasts. This could be an extremely useful forecasting technique not just for the multi-party European elections discussed by the authors, but also in relatively low-salience American local elections.

– This is concise, high-quality paper that demonstrates that the predictive power of (collective) recognition extends to the important domain of political elections.

And now the fun stuff. The negative comments:

– This is probably the strangest manuscript that I have ever been asked to review… Even if the argument is correct, I’m not sure that it tells us anything useful. The fact that recognition can be used to predict the winners of tennis tournaments and soccer matches is unsurprising – people are more likely to recognize the better players/teams, and the better players/teams usually win. It’s like saying that a football team wins 90% (or whatever) of the games in which it leads going into the fourth quarter. So what?

– To be frank, this is an exercise in nonsense. Twofold nonsense. For one thing, to forecast election outcomes based on whether or not voters recognize the parties/candidates makes no sense… Two, why should we pay any attention to unrepresentative samples, which is what the authors use in this analysis? They call them, even in the title, “lousy.” Self-deprecating humor? Or are the authors laughing at a gullible audience?

So, their paper is either “a very interesting piece of work” whose main finding is “stunning”—or it is “an exercise in nonsense” aimed at “a gullible audience.”

The post Combining apparently contradictory evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

“Check yourself before you wreck yourself: Assessing discrete choice models through predictive simulations”

Timothy Brathwaite sends along this wonderfully-titled article (also here, and here’s the replication code), which begins:

Typically, discrete choice modelers develop ever-more advanced models and estimation methods. Compared to the impressive progress in model development and estimation, model-checking techniques have lagged behind. Often, choice modelers use only crude methods to assess how well an estimated model represents reality. Such methods usually stop at checking parameter signs, model elasticities, and ratios of model coefficients. In this paper, I [Brathwaite] greatly expand the discrete choice modelers’ assessment toolkit by introducing model checking procedures based on graphical displays of predictive simulations. . . . a general and ‘semi-automatic’ algorithm for checking discrete choice models via predictive simulations. . . .

He frames model checking in terms of “underfitting,” a connection I’ve never seen before but which makes sense. To the extent that there are features in your data that are not captured in your model—more precisely, features that don’t show up, even in many different posterior predictive simulations from your fitted model—then, yes, the model is underfitting the data. Good point.

The post “Check yourself before you wreck yourself: Assessing discrete choice models through predictive simulations” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Using multilevel modeling to improve analysis of multiple comparisons

Justin Chumbley writes:

I have mused on drafting a simple paper inspired by your paper “Why we (usually) don’t have to worry about multiple comparisons”.

The initial idea is simply to revisit frequentist “weak FWER” or “omnibus tests” (which assume the null everywhere), connecting it to a Bayesian perspective. To do this, I focus on the distribution of the posterior maximum or extrema (not the maximum a posteriori point estimate) of the joint posterior, given a data-set simulated under the omnibus null hypothesis. This joint posterior may be, for example, defined on a set of a priori exchangeable random coefficients in a multilevel model: it’s maxima just encodes my posterior belief in the magnitude of the largest of those coefficients (which “should” be zero for this data) and can be estimated for example by MCMC. The idea is that hierarchical Bayesian extreme values helpfully contract to zero with the number of coefficients in this setting, while non-hierarchical frequentist extreme values increase. The latter being more typically quantified by other “error” parameters such as FWER “multiple comparisons problem” or MSE “overfitting”. Thus, this offers a clear way to show that hierarchical inference can automatically control the (weak) FWER, without Bonferroni-style adjustments to the test threshold. Mathematically, I imagine some asymptotic – in the number of coefficients – argument for this behavior of the maxima, that I would need time or collaboration to formalize (I am not a mathematician by any means). In any case, the intuition is that because posterior coefficients are all increasingly shrunk, so is their maximum. I have chosen to study the maxima because it is applicable across the very different hierarchical and frequentist models used in practice in the fields I work on (imaging, genomics): spatial, cross-sectional, temporal, neither or both. For example, the posterior maximum is defined for a discretely indexed, exchangeable random process, or a continuously-indexed, non-stationary process. As a point of interest, frequentist distribution of spatial maxima is used for standard style multiple-comparisons adjusted p-values in mainstream neuroimaging, e.g. SPM.

I am very keen to learn more about the possible pros or cons of the idea above.
-Its “novelty”
– How it fares relative to alternative Bayesian omnibus “tests”, e.g. based on comparison of posterior model probabilities for an omnibus null model – a degenerate spike prior – versus some credible alternative model.
-How generally it might be formalized.
-How to integrate type II error and bias into the framework.
… and any more!

My reply:

This idea is not really my sort of thing—I’d prefer a more direct decision analysis on the full posterior distribution. But given that many researchers are interested in hypothesis testing but still want to do something better than classical null hypothesis significance testing, I thought there might be interest in these ideas. So I’m sharing them with the blog readership. Comment away!

The post Using multilevel modeling to improve analysis of multiple comparisons appeared first on Statistical Modeling, Causal Inference, and Social Science.

Back to the Wall

Jim Windle writes:

Funny you should blog about Jaynes. Just a couple of days ago I was looking for something in his book’s References/Bibliography (it along with “Godel, Escher, Bach” and “Darwin’s Dangerous Idea” have bibliographies which I find not just useful but entertaining), and ran across something I wanted to send you but I was going to wait until I could track down a copy of the actual referenced paper. But since Jayne’s is the current topic here the cited work and his comment which I thought might amuse you relating to our previous exchange. From “References”:

Boring, E.G. (1955), ‘The present status of parapsychology’, Am. Sci., 43, 108-16
Concludes that the curious phenomena to be studied is the behavior of parapsychologists. Points out that, having observed any fact, attempts to prove that no natural explanation of it exists are logically impossible; one cannot prove a universal negative (quantum theorist who deny the existence of casual explanations please take note)

And just for the record, I’m more comfortable with quantum uncertainty, to the extent I understand it, than Jaynes. And I don’t fully agree about not being able to prove a negative. The ancient Greeks proved long again that there’s no largest prime number. I guess you just have to be careful about how you define the negative.

Amusing, and of course it relates to some of our recent discussions about unreplicable work in the social and behavioral sciences, including various large literatures which seem to be based on little more than the shuffling of noise, the ability of certain theories to explain any possible patterns in data, and the willingness of journals to publish any sort of junk as long as it combines an attractive storyline with “p less than 0.05.”

It’s only been 63 years, I guess no reason to expect much progress!

The post Back to the Wall appeared first on Statistical Modeling, Causal Inference, and Social Science.

What is probability?

This came up in a discussion a few years ago, where people were arguing about the meaning of probability: is it long-run frequency, is it subjective belief, is it betting odds, etc? I wrote:

Probability is a mathematical concept. I think Martha Smith’s analogy to points, lines, and arithmetic is a good one. Probabilities are probabilities to the extent that they follow the Kolmogorov axioms. (Let me set aside quantum probability for the moment.) The different definitions of probabilities (betting, long-run frequency, etc), can be usefully thought of as models rather than definitions. They are different examples of paradigmatic real-world scenarios in which the Kolmogorov axioms (thus, probability).

Probability is a mathematical concept. To define it based on any imperfect real-world counterpart (such as betting or long-run frequency) makes about as much sense as defining a line in Euclidean space as the edge of a perfectly straight piece of metal, or as the space occupied by a very thin thread that is pulled taut. Ultimately, a line is a line, and probabilities are mathematical objects that follow Kolmogorov’s laws. Real-world models are important for the application of probability, and it makes a lot of sense to me that such an important concept has many different real-world analogies, none of which are perfect.

We discuss some of these different models in chapter 1 of BDA.

The post What is probability? appeared first on Statistical Modeling, Causal Inference, and Social Science.

“Thus, a loss aversion principle is rendered superfluous to an account of the phenomena it was introduced to explain.”

What better day than Christmas, that day of gift-giving, to discuss “loss aversion,” the purported asymmetry in utility, whereby losses are systematically more painful than gains are pleasant?

Loss aversion is a core principle of the heuristics and biases paradigm of psychology and behavioral economics.

But it’s been controversial for a long time.

For example, back in 2005 I wrote about the well-known incoherence that people express when offered small-scale bets. (“If a person is indifferent between [x+$10] and [55% chance of x+$20, 45% chance of x], for any x, then this attitude cannot reasonably be explained by expected utility maximization. The required utility function for money would curve so sharply as to be nonsensical (for example, U($2000)-U($1000) would have to be less than U($1000)-U($950)).”)

When Matthew Rabin and I had (separately) published papers about this in 1998 and 2000, we’d attributed the incoherent risk-averse attitude at small scales to “loss aversion” and “uncertainty aversion.” But, as pointed out by psychologist Deb Frisch, it can’t be loss aversion, as the way the problem is set up above, no losses are involved. I followed up that “uncertainty aversion” could be logically possible but I didn’t find that labeling so convincing either; instead:

I’m inclined to attribute small-stakes risk aversion to some sort of rule-following. For example, it makes sense to be risk averse for large stakes, and a natural generalization is to continue that risk aversion for payoffs in the $10, $20, $30 range. Basically, a “heuristic” or a simple rule giving us the ability to answer this sort of preference question.

By the way, I’ve used the term “attitude” above, rather than “preference.” I think “preference” is too much of a loaded word. For example, suppose I ask someone, “Do you prefer $20 or [55% chance of $30, 45% chance of $10]?” If he or she says, “I prefer the $20,” I don’t actually consider this any sort of underlying preference. It’s a response to a question. Even if it’s set up as a real choice, where they really get to pick, it’s just a preference in a particular setting. But for most of these studies, we’re really talking about attitudes.

The topic came up again the next year, in the context of the (also) well-known phenomenon that, when it comes to political attitudes about the government, people seem to respond to the trend rather than the absolute level of the economy. Again, I felt that terms such as “risk aversion” and “loss aversion” were being employed as all-purpose explanations for phenomena that didn’t really fit these stories.

And then, in the midst of all that, David Gal published an article, “A psychological law of inertia and the illusion of loss aversion,” in the inaugural issue of the Journal of Judgment and Decision Making, saying:

The principle of loss aversion is thought to explain a wide range of anomalous phenomena involving tradeoffs between losses and gains. In this article, I [Gal] show that the anomalies loss aversion was introduced to explain — the risky bet premium, the endowment effect, and the status-quo bias — are characterized not only by a loss/gain tradeoff, but by a tradeoff between the status-quo and change; and, that a propensity towards the status-quo in the latter tradeoff is sufficient to explain these phenomena. Moreover, I show that two basic psychological principles — (1) that motives drive behavior; and (2) that preferences tend to be fuzzy and ill-defined — imply the existence of a robust and fundamental propensity of this sort. Thus, a loss aversion principle is rendered superfluous to an account of the phenomena it was introduced to explain.

I’d completely forgotten about this article until learning recently of a new review article by Gal and Derek Rucker, “The Loss of Loss Aversion: Will It Loom Larger Than Its Gain?”, making this point more thoroughly:

Loss aversion, the principle that losses loom larger than gains, is among the most widely accepted ideas in the social sciences. . . . The upshot of this review is that current evidence does not support that losses, on balance, tend to be any more impactful than gains.

But if loss aversion is unnecessary, why do psychologists and economists keep talking about it? Gal and Rucker write:

The third part of this article aims to address the question of why acceptance of loss aversion as a general principle remains pervasive and persistent among social scientists, including consumer psychologists, despite evidence to the contrary. This analysis aims to connect the persistence of a belief in loss aversion to more general ideas about belief acceptance and persistence in science.

In Table 1 of their paper, Gal and Rucker consider several phenomena, all of which are taken to provide evidence of loss aversion, can be easily explained in other ways. Here are the phenomena they talk about:

– Status quo bias

– Endowment effect

– Risky bet premium

– Hedonic impact ratings

– Sunk cost effect

– Price elasticity

– Equity risk premium

– Disposition effect

– Loss/gain framing.

The article also comes with discussions by Tory Higgins and Nira Liberman and Itamar Simonson and Ran Kivetz and rejoinder by Gal and Rucker.

The post “Thus, a loss aversion principle is rendered superfluous to an account of the phenomena it was introduced to explain.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

June is applied regression exam month!

So. I just graded the final exams for our applied regression class. Lots of students made mistakes which gave me the feeling that I didn’t teach the material so well. So I thought it could help lots of people out there if I were to share the questions, solutions, and common errors.

It was an in-class exam with 15 questions. I’ll post the questions and solutions, one at a time, for the first half of June, following the model of my final exam for Design and Analysis of Sample Surveys from a few years ago. Enjoy.

The post June is applied regression exam month! appeared first on Statistical Modeling, Causal Inference, and Social Science.

“When Both Men and Women Drop Out of the Labor Force, Why Do Economists Only Ask About Men?”

Dean Baker points to this column, where Gregory Mankiw writes:

With unemployment at 3.8 percent, its lowest level in many years, the labor market seems healthy.

But that number hides a perplexing anomaly: The percentage of men who are neither working nor looking for work has risen substantially over the past several decades. . . . This last group is ignored when calculating the unemployment rate. . . .

The data show some striking changes over time. Among women, the share out of the labor force has fallen from 66 percent in 1950 to 43 percent today. . . . Men, however, exhibit the opposite long-term trend. In 1950, 14 percent of men were out of the labor force. Today, that figure stands at 31 percent.

Mankiw goes on to look at some subsets of the male population:

Consider prime-age men, those from the ages of 25 to 54. These men are generally well past their schooling and well before their retirement. . . . In 1950, only 4 percent of prime-age men were not working or looking for work. Today, that figure is 11 percent.

Wow. That is pretty stunning. Mankiw continues:

Why has that number nearly tripled? . . . One likely hypothesis, discussed in a recent paper by the economists Katharine G. Abraham and Melissa S. Kearney, is that the rise in nonparticipation is related to declining opportunities for those with low levels of education. . . .

So what’s the problem? Baker writes:

We’ll get to the explanations in a moment, but the biggest problem with explaining the drop in labor force participation among men as a problem with men is that since 2000, there has been a drop in labor force participation among prime-age women also.

In we take the May data, the employment to population ratio (EPOP) for prime-age women stood at 72.4 percent.[1] . . . the drop against the 2000 peak of 74.5 percent is more than two full percentage points. That is less of a fall than the drop in EPOPs among prime men since 2000 of 3.2 percentage points, but it is a large enough decline that it deserves some explanation. In fact, the drop looks even worse when we look by education and in more narrow age categories.

In a paper last year that compared EPOPs in the first seven months of 2017 with 2000, Brian Dew found there were considerable sharper declines for less-educated women in the age groups from 35 to 44 and 45 to 54, than for men with the same levels of education. The EPOP for women between the ages of 35 and 44 with a high school degree or less fell by 9.7 percentage points. The corresponding drop for men in this age group was just 3.4 percentage points.

The EPOP for women with a high school degree or less between the ages of 45 and 54 fell by 6.7 percentage points. For men, the drop was 3.3 percentage points. Only with the youngest prime-age bracket, ages 25 to 34, did less educated men see a larger falloff in EPOPs than women, 8.2 percentage points for men compared to 6.9 percentage points for women.

Looking at these data, it is a bit hard to understand economists’ obsession with explaining the drop in EPOPs for men. . . .

Now I was curious. I’m not experienced working with such data so I did a quick google search and found this page which makes graphs for the employment-to-population ratio for men and women.

Here’s the time series that Mankiw talked about: the decline for men and rise for women since 1950:

Now let’s focus on that last period since 1999:

I wasn’t able to easily grab the breakdowns by age. Overall, though, I see Baker’s point: In the 2008-2009 recession, employment dropped a lot more for men than for women, but then in the long recovery, the rebound was much greater for men. So if you want to ask, “Why aren’t so many people working?”, I agree with Baker that it seems to be a mistake to focus on men and not look at what’s happening with women too. This does not mean that Mankiw’s piece is so bad—there’s not room to cover everything in a newspaper article.

The narrative of the crisis among men

One thing that did bother me, though, was the way in which Mankiw’s story fits all too comfortably into the media narrative about the decline of men. It reminded me of the discussion we had a couple years ago about trends in mortality of middle-aged white Americans. This had been widely reported as yet another misfortune happening to men, but it turned out when you break up the data that the mortality rate was decreasing for white men in the age group in question; it was the women (in particular, women in the south) whose death rates were going up.

The post “When Both Men and Women Drop Out of the Labor Force, Why Do Economists Only Ask About Men?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Carol Nickerson explains what those mysterious diagrams were saying

Psychological Science 24(7) 1123-32, Fig. 2 copy

A few years ago, James Coyne asked, “Can you make sense of this diagram?” and I responded, No, I can’t.

At the time, Carol Nickerson wrote up explanations for two of the figures in the article in question. So if anyone’s interested, here they are:

Carol Nickerson’s explanation of Figure 2 in Kok et al. (2013):

I am not by any stretch of the imagination an expert on structural equation modeling but here is what I believe that this diagram is attempting to convey. But first, a bit on the relevant SEM symbol conventions:

Variable names in square or rectangular boxes represent manipulated, measured, or observed variables. “Experimental Condition” is a manipulated variable, for example. “Baseline Vagal Tone” (start point HF_HRV) is a measured variable. “PE w1”, which stands for “Positive Emotions week 1” is a measured variable. (Actually, it is the mean of a set of measured variables.) And so on.

Variable names in circles or ovals represent “latent” (hidden) unmeasured or “inferred” variables. “Positive Emotions Intercept” and “Positive Emotions Slope” are unmeasured variables; they are inferred from the 9 weeks of positive emotions measurements. More on this in a bit.

Long straight lines (or “paths”) with an arrowhead at one end indicate relations between variables. The arrowhead points to the predicted (criterion, dependent) variable; the other end is attached to the predictor (independent) variable(s). Kok et al. stated that they were using black lines to represent hypothesized relations between variables, solid gray lines to represent relations between variables in the literature that they expected to be replicated, and dotted gray lines to represent relations about which they did not have hypotheses and which they expected to be non-significant. For convenience, Kok et al. labeled with lower case Roman letters the paths that they discussed in their article and supplemental material.

Most lines represent “main” effects. Paths that intersect with a big black dot indicate a “moderator” effect or an “interactive” effect. A criterion variable is predicted from all the variables that have lines with arrowheads running into them. So, for example, “Social Connections Slope” is being predicted only from “Positive Emotions Slope”; this path is labeled “h”. But “Positive Emotions Slope” is being predicted from “Experimental Condition”, “Baseline Vagal Tone” (start point HF-HRV), and the interaction of these two variables; these paths are labeled “d”,”f”, and “e”, respectively.

Kok et al. conceived of change, or growth, or increase (or decrease) in positive emotions as the slope of a line constructed from 9 weekly measurements of positive emotions. Suppose that we have a graph with the x axis labeled 0 (at the y axis) to 8 for the 9 weeks, and the y axis labeled 1 to 5 for positive emotions scores. Then we plot a participant’s 9 weekly positive emotions scores against these time points on this graph and fit a line to them. That line has a slope, which indicates the rate of change, or growth, or increase (or decrease)in positive emotions across time. The line has an intercept with the y axis, which is the start point; no growth in positive emotions has occurred yet. On Kok et al.’s Figure 2, the 9 lines from the oval labeled “Positive Emotions Slope” to the 9 rectangles labeled “PE w1”, “PE w2”, etc., indicate that this growth (slope) is being inferred from the 9 weekly measurements of positive emotions. The 9 rectangles are equally spaced, perhaps to indicate that the 9 measurements are equally spaced across time, although usually there are numbers near the lines that indicate the measurement spacing. The 9 lines from the oval labeled “Positive Emotions Intercept” to the 9 rectangles labeled “PE w1”, “PE w2”, etc., similarly indicate that the intercept or start point of growth is being inferred from the 9 weekly measurements of positive emotions. [Note: This is a conceptual explanation. I don’t have MPlus, so I don’t know what the computer program is actually doing.]

The Greek lower case letter delta (the “squiggle”) with a short gray arrowheaded line to each of the PE rectangles indicates that positive emotions are measured with error at each week. Ditto for the SC rectangles. Delta is used for error on predictors, usually, and lambda is used for error on criterion variables. All measured variables have error, of course; Kok et al. omitted error for most of the variables because this is assumed. They included error for the special situation described below.

Kok et al. were also interested in change, or growth, or increase (or decrease) in perceived social connections. Again, they conceived of change as the slope and the start point as the intercept. But they realized (or perhaps a reviewer pointed out) that participants were recording their positive emotions and their social connections at the same time, so that these measurements were not independent. Therefore, they (actually, the MPlus computer program) adjusted for “correlated residuals”. In SEM, correlations are indicated by circular lines with arrowheads at both ends. You can see these short circular lines in the diagram between the deltas for the PE rectangles and the deltas for the SC rectangles. These lines just indicate that the residuals (errors) for each week’s PE and SC are correlated, not independent (as is required for many statistical techniques).

So, let’s match the diagram to Kok et al.’s hypotheses and text, working from the top. Their three hypotheses were

(1) Experimental condition (control group, LKM group) and start point vagal tone (HF-HRV in the article, RSA in some of the sections of the supplemental material) interact to predict change in positive emotions over the 9 weeks of the study; LKM group participants who entered the study with higher vagal tone showed greater increases in positive emotions.

(2) Increases in positive emotions predict increases in perceived social connections.

(3) Increases in perceived positive social connections predict increases in start point to end point vagal tone.

From the top of the diagram, the black lines labeled “d”, “e”, and “f” indicate that change in positive emotions (the oval labeled “Positive Emotions Slope”) is being predicted from the variable experimental condition (the rectangle labeled “Experimental Condition”), the variable start point HF-HRV (the rectangle labeled “Baseline Vagal Tone”), and the interaction of these two variables (the big black dot). This part of the diagram represents Hypothesis 1. [Note: It is not stated in the diagram or the article that HF-HRV was subjected to a square root transformation, an oversight that I brought to Kok’s attention at some point. Accordingly, she inserted this information into the top of the supplemental material, which seems an odd place for it.]

The black line labeled “h” leading from the oval labeled “Positive Emotions Slope” to the oval labeled “Social Connections Slope” indicates that growth (increase) in social connections is being predicted from growth (increase) in positive emotions. This part of the diagram represents Hypothesis 2.

The black line labeled “k” leading from the oval labeled “Social Connections Slope” to the oval labeled “Change in Vagal Tone” indicates that change (increase) in vagal tone is being predicted from growth (increase) in social connections. [They have taken a shortcut here, probably for simplicity. They didn’t indicate that change in vagal tone in based on the residuals from a regression of end point vagal tone on start point vagal tone; ordinarily this would be shown in the diagram and mentioned in the article. It is not mentioned until the supplemental material.] This part of the diagram represents Hypothesis 3.

In the supplemental material, Kok et al. included tests of hypotheses like the three above,
but substituted “Positive Emotions Intercept” for “Positive Emotions Slope.” For the intercept analogue of the slope hypothesis for Hypothesis 1, it appears that there is a mistake in the diagram. In the supplemental material, Kok et al. indicated that they predicted “Positive Emotions Intercept” from “Experimental Condition”, “Baseline Vagal Tone,” and the interaction of these two variables. Path “a” for the “Positive Emotions Intercept” is analogous to path “d” for the “Positive Emotions Slope”; path “c” for the Positive Emotions Intercept” is analogous to path “f”; but there is no path for the “Positive Emotions Intercept” that is analogous to path “e” for the interaction for the “Positive Emotions Slope”. Instead they have a path “b”. The drawing seems incorrect here.

For the intercept analogue of the slope hypothesis for Hypothesis 2, path “g” is like path “h”.

For the intercept analogue of the slope hypothesis for Hypothesis 3, path “j” is like path “k”.

So, making sense of this diagram is not really very difficult, once you know a little bit about how structural equation modeling diagrams are constructed. (I am in no way saying that this is good research. It isn’t. I am just explaining the diagram.)

The one aspect of the diagram that puzzles me is Kok et al.’s statement in the Figure 2 caption that the solid gray lines represent anticipated significant replications of the literature. Lines like these (not necessarily solid gray) usually appear in diagrams that infer a slope and an intercept from multiple measurements of some sort, without mention of anticipated replications of the literature. Perhaps the diagram was modified at some point without the necessary accompanying change to the caption?

Was the diagram a necessary addition to the text? Does it aid in understanding what the authors are doing? I don’t think so. It seems to me that a *good* verbal explanation would have been sufficient. On the other hand, it is standard practice for articles using structural equation modeling to include such a model, especially if it is complicated, as such models often are. Moreover, PSYCHOLOGICAL SCIENCE has a severe and enforced word limit, and authors sometimes work around that by including tables and figures that should not be necessary.

Carol Nickerson’s explanation of Figures 4 and 5 in Kok et al. (2013):

These figures take up about three-quarters of a page each; they could have been more succinctly and clearly presented as standard (frequency) “contingency tables” aka “cross-tabulations” of the sort that one sees in chi-square analyses.

Here is the information for Figure 4 in a contingency table. Kok et al. first took the distribution for change in positive emotions and divided it into quarters, ditto the distribution for change in perceived social connections. They then tabulated the number of study participants in each of the sixteen cells constructed by crossing the four quartiles for change in positive emotions and the four quartiles for change in social connections for each of the two study groups (control and LKM).

Control LKM SC SC quartile quartile PE 1 2 3 4 total PE 1 2 3 4 total
quartile -------------------- quartile ------------------- 1| 8 3 1 0 | 12 1| 2 2 0 1 | 5 2| 5 4 5 1 | 15 2| 0 1 0 0 | 1 3| 1 1 2 2 | 6 3| 1 3 4 2 | 10 4| 0 0 0 1 | 1 4| 0 4 4 9 | 15 -------------------- -------------------
total 14 8 8 4 | 34 total 3 8 8 12 | 31 N = 65

So that does this contingency table tell us? The first thing it tells us is that the sample size is too small to say anything with much certainty. But that is the case for the whole damn study. Let’s ignore that.

So what can we say?

(a) Looking at the row marginals, the control participants seem more likely to have changes in positive emotions at the low end of the distribution and the LKM participants seem more likely to have changes in positive emotions at the high end of the distribution.

(b) Looking at the column marginals, the control participants seem more likely to have changes in social connections at the low end of the distribution and the LKM participants seem more likely to have changes in social connections at the high end of the distribution.

(c) Do participants with low changes in positive emotions have low changes in social connections? Do participants with high changes in positive emotions have high changes in social connections? It appears so for both the control group and the LKM group. The control participants are clumped in the upper left-hand corner (low – low) of their table; the LKM participants are clumped in the lower right-hand corner (high – high) of their table.

(d) Is the size or form of the relation different for the two groups? Probably not.

Now what did Kok et al. write about this figure? From p. 1128: “As predicted by Hypothesis 2, slope of change in positive emotions significantly predicted slope of change in social connections (path h in Fig. 2; b = 1.04, z = 4.12, p < .001; see Fig. 4)." That's all. They didn't distinguish between control and LKM. So it seems that Figure 4 refers to (c) above. If they weren't distinguishing between control and LKM, the two groups could be combined:

 SC quartile PE 1 2 3 4 total quartile ——————– 1| 10 5 1 1 | 17 2| 5 5 5 1 | 16 2| 2 4 6 4 | 16 4| 0 2 4 10 | 16 ——————– total 17 16 16 16 | 65

which is statistically significant (except that the cell sizes are still too small for the test to be valid).

But the figure is not an appropriate way to illustrate the effect. All of the effects in the article deal with scores — means, slopes, regression coefficients, etc. The figure deals with numbers of participants, which is not the same, and which was not what was being analyzed in the model.

I don’t see any point to this figure at all. The statistics reported indicated that change in social connections was positively predicted by change in positive emotions; as positive emotions increased, so did social connections. Why does the reader need a 3/4 page figure to understand this?

A similar analysis of Figure 5 is left to the interested reader :-).

The analysis for Figure 4 had N = 65; for Figure 5, N = 52. This is much too small a sample size for such a complicated model. Kok et al. stated that they used a variant of a “mediational, parallel-process, latent-curve model (Cheong, MacKinnon, & Khoo, 2003).” The analysis example in the Cheong et al. article had an N in the neighborhood of 1300-1400 (unnecessarily large, perhaps). Never mind all the other problems in the Kok et al. article, PSYCHOLOGICAL SCIENCE should have rejected it just on the basis of the small N. I’d be wary of the results of a simple one-predictor model with an N of only 52!

So, it seems that Nickerson put much more work into this than was warranted by the original paper, which received prominence in large part because it was published in Psychological Science, which during that period was rapidly liquidating its credibility as a research journal, publishing papers on ovulation and voting, etc.

I’m sharing Nickerson’s analysis here because it could be future to future students and researchers, not so much regarding this particular paper, but more generally as a demonstration that statistical mumbo-jumbo can at times, with some effort, be de-mystified. One can’t expect journal reviewers to go to the trouble of doing this—but I would hope that authors of journal articles would try a bit harder themselves to understand what they’re putting in their articles.

The post Carol Nickerson explains what those mysterious diagrams were saying appeared first on Statistical Modeling, Causal Inference, and Social Science.