Jonah Lehrer Interviews Christopher ChabrisShare
A conversation between May 14 and June 20 on writing about social science (and doing it).
Editor's Note: We fully recognize the severity of the actions that recently led Jonah Lehrer to resign his post at The New Yorker. This interview, conducted between May 14 and June 20, 2012, was submitted to us by Christopher Chabris three days before that announcement. After verifying its accuracy, and confirming permission from both authors, we felt it was important to share it because it explores such pressing questions about research and writing in social science.
Science writer Jonah Lehrer published his third book, Imagine: How Creativity Works, in March of 2012. It was reviewed in the New York Times Book Review by Christopher Chabris, a psychology professor at Union College in New York and the co-author (with Daniel Simons) of The Invisible Gorilla, and Other Ways Our Intuitions Deceive Us. The review contained several criticisms of the book, to which Lehrer replied on his blog. Chabris then posted a rebuttal with additional critical comments on his website. Lehrer emailed Chabris the next day and proposed to interview him about some of the issues that were in dispute. Chabris agreed, and over the ensuing weeks the interview took place by email. Below is an edited transcript of the conversation:
In your review, you criticized my failure to discuss the replication of a 2009 Science paper that looked at the effects of color priming. I think your criticism raises a really important issue, which is how journalists and academics should write about provisional results. It’s no secret that psychology is replete with very interesting studies that fail the long-term test of replication. Sometimes, these studies are clearly falsified. But mostly what happens is this peculiar purgatory in which the results are inconclusive: the effect is often replicated, but only under particular circumstances, which we typically fail to understand. One of my favorite examples of this is stereotype threat, which refers to the anxiety that occurs when people are reminded of a negative stereotype about their social group. First pioneered by Claude Steele and Joshua Aronson in the early 1990s—they showed that African-American students underperformed on academic tests when informed that the test measured intelligence—the effect has since been confirmed in all sorts of clever situations, from the jumping abilities of white men to the chess abilities of women.
And yet, as shown in Table 3 of a meta-analysis by Nguyen and Ryan, the effect sizes for stereotype threat are all over the place, with many unpublished studies failing to find the effect at all. This doesn’t mean stereotype threat isn’t real. In fact, this 2008 meta-analysis concludes the effect exists. It just means something is happening here, and we don't quite know what it is.
As I noted in this article, similar trends appear to afflict many influential ideas in psychology. (The most notable replication dust-up of 2012 involves a failed replication of a classic John Bargh priming paper). So here’s my question: how do you recommend dealing with this issue, both as a professor and science writer? Do you think one should always reference the publication history when citing a study? You have a wonderful discussion of the failed replication of the so-called Mozart effect in The Invisible Gorilla but I couldn’t help but notice that you don’t delve into the replication history of the many other experiments you cite in the book. Was that because you deemed them solid? If so, what led you to conclude that they were solid? (On a side note, isn’t it ridiculous that no one has invented an academic search tool that automatically collects the various attempted replications of a given study?) I’m also curious as to how you deal with this as a professor. Stereotype threat, for instance, is cited in three of my psychology textbooks. And yet, not a single one points out the complicated experimental history of the effect.
Let me first say that I am not an expert on stereotype threat, and I have not studied the literature on whether and when the basic effects replicate. But I am not surprised that the story appears more complicated than its portrayal in psychology textbooks would lead one to believe. As you note, this happens often. Indeed, one should expect variability in effect sizes in a meta-analysis, especially given the small sample sizes apparent in the studies included in this one. Getting an accurate effect-size estimate from a meta-analysis, however, depends on an absence of publication bias in the literature, and the well-known bias of journals to not publish non-replications (see, e.g., the experience following Daryl Bem’s psi paper) almost guarantees that meta-analytic estimates of effect sizes will be inflated.
I should say that I don't believe in a “decline effect” (notwithstanding your very interesting New Yorker article on the subject). There may be reasons for specific effects declining; perhaps for antidepressants, the population of patients in recent studies differs from that in earlier studies. But in general, effects don’t automatically decline or go away—they were simply not really that big to start with. I think that the winner’s curse, plus publication bias, plus small sample sizes, plus unrecognized “investigator degrees of freedom” all combine to create a pattern in which the first published estimates of an effect are likely to be much too large.
This means that if writers of textbooks or trade books or other popular renderings of research results focus on splashy findings that have been published recently, they in turn are guaranteed to have a lot of “false positives” in their own works. What to do about this? I don’t think that ignoring the problem and writing as though every study is equally credible, or that the first, most publicized study is the only one that exists, or that the only published study is always likely to hold up to future replication and scrutiny, is a reasonable solution. Nor is breaking up a narrative to insert several paragraphs of literature review every time a study is mentioned.
One solution is for writers to focus more tightly on those scientific results that are most likely to be true. They could identify those studies by looking for large effects, large sample sizes, multiple independent replications, and meta-analytic reviews. They could be wary of studies with small effects, p-values clustering near .05, small sample sizes, no replications (or all replications coming from a single lab or investigator team), and no meta-analytic reviews. Of course, a lot of judgment has to be involved in this selection. That judgment should be informed by a broad knowledge of what is going on in the relevant research areas, as well as a good intuitive understanding of statistical methods and research design. I would argue that today, science writers and journalists have to do more than rewrite press releases and take researchers’ interpretations and claims about their own results at face value. They should not only look for implications, patterns, and trends in the surface features of the scientific landscape, but also cast a critical eye on what lies beneath. In short, they should try to do the things journalists are supposed to do. The best ones are of course capable of this, and the rest can learn.
It’s true that in The Invisible Gorilla, Dan Simons and I didn’t go into the “replication history” of most of the findings we discussed, except in the case of the Mozart effect (and the effects of videogames and other cognitive training regimens that we discussed in the book). I think we were justified in this because most of the effects we discussed were large, multiply replicated ones whose basic existence was not in doubt. However, I can think of two experiments we presented that deserved more careful analysis. One was by Gideon Keren and colleagues, involving people’s preferences for weather forecasters who made different kinds of forecasts. The issue is not whether that effect replicates (in fact we replicated it ourselves), but whether it means what we said it did. The other study was by Richard Thaler and colleagues, and the issue there was whether the simplified experimental asset-allocation game they created generalized well to the more complex environment in which actual investors operate. If we were to write about those studies today, we would have been more nuanced. Social science is complicated, and it is hard to get everything right. But we still must try. I think it’s possible to do so without losing readers, and that painting an oversimplified picture of science might gin up excitement and interest in the short run, but does no one a service in the long run. Now as to that tool you suggest, for finding all the attempts at replicating a particular study—sign me up as a beta-tester! But before we get too excited, we have to find ways to make sure that both successful and failed replications (and everything in between) can be made public in a credible forum so that whatever tools we use can come to more accurate conclusions about whether the effects are real.
Thanks again for starting this conversation. I look forward to the rest of it!
So I’m curious: given the replication issues in the social sciences, what reforms to the scientific process would you advocate? One issue I wrestle with as a science writer is the urge to write about counter-intuitive findings. There is, of course, something irresistible about surprise, about learning that the conventional wisdom and/or scientific consensus has been neatly overturned in the latest issue of some academic journal. The bad news, of course, is that those lovely surprises tend to be statistical outliers; there’s often a very good reason why the consensus exists.
Of course, these same incentives can influence scientists and journals. Any inkling how we might fix this problem? The “man bites dog” bias strikes me as both very important and very, very difficult to counteract.
It makes sense for all of us in the process (researchers, writers, readers) to be alert for the surprising and counterintuitive, for the obvious reason that such facts and findings are most likely to demand updating of our prior theories and beliefs. I think the problem is that we are too insensitive to the uncertainty that surrounds these surprises. When you read about a finding in a journal article, see it described in the Wall Street Journal, or hear about it on TV—especially if all the potentially limiting caveats and details like sample size, p-values, etc. are removed (or if you don’t have the training to understand them in the first place)—you are likely to accept the finding as true. In 1991 Dan Gilbert published a wonderful article in American Psychologist about our tendency to automatically tag new statements as true as soon as we hear them; he argued that it takes an extra, effortful step to re-tag them as false or uncertain. So we are susceptible to believing every new result, at least at first.
Personally, I think the human appetite for what’s new (as reflected in, among other things, the frequency with which we ask each other “what’s new?”) has to be accepted, as does the desire of researchers and journalists to come up with good answers to that question. No one in science wants to make a career of replicating or reporting on only the most settled truths, except perhaps textbook writers, and even they often include the newest findings before they might be fully ripe for canonization.
However, I can think of two separate areas where improvements might be made. First, academic scientists can increase the value placed on attempts to replicate findings. Checking the important findings of others should be an honorable task. This would mean giving students and other researchers incentives to incorporate more direct replication attempts into their own work. Independent replication is the best way to decide whether a finding is “real,” and both successful and unsuccessful results of replication attempts should be valued equally highly. Perhaps the graduate-level thesis project in some fields could be divided into one part replication, one part extension/innovation. Perhaps new journals could be developed that would publish replication attempts regardless of result, on condition that the methods and analysis plan were cleared in advance by the editor, reviewers, and maybe even the author(s) of the study under replication. Alex Holcombe and Dan Simons are working to launch exactly such a journal for the field of experimental psychology, and I hope they succeed. Second, journalists can take a greater interest in whether what they are writing about turns out to be true or not. Reporting that merely describes the “study of the week” as though its results must be true, and its implications must be urgent, without providing context, evaluation, and caveats, is of little value and is most likely to be in error. Some publishers and editors won’t care about having a reputation for sensationalism in science writing; they probably already have such a reputation in other areas, and sophisticated readers will recognize this. But publishers and editors who do care about getting it right might learn to separate good from bad practices in their science coverage, just as they try to do in other areas. All of this applies especially to books, since they will be read for years after they are published, and readers of science books expect the information they contain to be more timeless than what they can find in blogs.
You also criticized me for citing field studies that attempted to infer causation from mere correlation. Obviously, that’s a long-standing empirical problem. One way to clarify the issue has been the use of controlled experiments in the lab, which can put our speculative inferences about causal forces to the test. And yet, the more I write about the sciences of human nature, the more convincing I find many field studies, even if they tend to have messier data, smaller effect sizes and more ambiguous causal stories. Not only are most lab experiments conducted on a small segment of the population—according to one recent meta-analysis, undergrads from a short list of Western nations make up 80 percent of study subjects in top journals—but they’re often done in a way that bears little resemblance to real life. (Sometimes, this is a necessity of the experimental tool—lying down in an fMRI machine is a very bizarre experience.) Although I certainly fret about causal mistakes, I also sympathize with researchers who argue that the criticism is often used to discourage field work, which almost inevitably comes with lots of confounding variables.
Do you see this as a tension? Do you worry about the skewed samples of traditional psych studies? And how might we make our lab experiments more realistic?
In my review of Imagine, I didn’t mean to criticize you for using field studies per se. I don’t object at all to anyone citing field studies. I have cited and conducted them myself. In fact, I think that in many social sciences, field studies are undervalued, and that more should be conducted and published. Having a more representative, and often larger sample, than we can test in the lab, is just one of many reasons.
Sometimes lab effects generalize nicely to the field, sometimes they don’t, and sometimes we have no idea. In those latter cases we should be cautious about prescribing policy and advising behavior change. Economists have been especially sensitive to the lab-field gap and the size of effects in the real world precisely because they think, either explicitly or implicitly, in policy terms. I think psychologists should take the lab-field gap more seriously. Field studies are crucial links in the chain that enables us to generalize from theory and lab experiment to real-world application—and this is exactly the chain that many science writers want to follow in their articles and books.
But there is a crucial distinction between field studies in general and field experiments in particular. Measuring a correlational effect in the field is incredibly valuable, but doing so does not enable one to infer cause-and-effect any more than measuring the same effect in the lab does. As long as the data are observational, with uncontrolled conditions and no random assignment (or no natural experiment that sufficiently approximates random assignment), there is always the possibility that unobserved variables explain some or all of the observed association. It makes no difference whether the observations are made in the field or in the lab; without random assignment (at a minimum), causality is a hypothesis rather than a conclusion.
For example, in The Invisible Gorilla, Dan Simons and I argued that Boston police officer Kenny Conley might have experienced inattentional blindness and actually failed to notice a brutal assault near where he was running in pursuit of a suspect late at night. This possibility was not considered by the legal system when Conley was convicted of lying under oath by claiming he saw nothing. We knew from a large number of lab experiments that inattentional blindness can occur for stimuli that vary in content, duration, and salience, and it seemed logical that Conley might have had the same experience as many participants in these experiments. But we admitted in the book that there were limits to this explanation, because of the difference in circumstances between those where inattentional blindness had been rigorously demonstrated through experimental methods (computer screens and videos in well-lit psychology labs during the day), and those in which Conley found himself (outdoors, at night, running). When I talked with students in a seminar at Union College about that chapter of the book, I realized that we could try to recreate—as closely as was reasonably possible—Conley’s situation, and see whether inattentional blindness would occur. So we had subjects run along a path through campus while a group of three confederates staged a fight nearby, and we asked the subjects at the end of the run whether they had noticed the fight. We found that, depending on whether it was day or night, and how much the subjects had to pay attention to while running, 28–65% of the subjects did not see the fight. This kind of field experiment helps to fill in the inferential gap that we had to acknowledge when we wrote about Conley’s case in our book.
David Laibson, Dan Ariely, Esther Duflo, and many other social scientists have done very clever field experiments in which they persuade businesses or other organizations to randomize something they do in order to measure the effect in a natural setting with a much more representative sample of the population. These are difficult to do, but they can be done, and they have become the gold standard for validating policy and business ideas, just as randomized, controlled trials (RCTs) are the gold standard for demonstrating drug efficacy. To be clear, I am by no means denigrating observational field studies. These are also important (e.g., they constitute virtually the entire field of epidemiology, which is vital to public health), but we should understand that they are valuable because they show the existence and size of correlations in real world settings, not because they somehow demonstrate causation. Nor am I claiming that every RCT is perfect. No single method answers all questions, but randomized experiments answer some questions better than any other method we know of.
Last question: How has the increasing awareness of the limitations of psychology and neuroscience, at least given current experimental practices and tools, changed your own research, teaching and writing? Are these issues you find yourself emphasizing to a greater degree with undergrads and grad students? Has it informed your own data collection?
That’s a great question. One of the challenges in research and teaching is to keep up with current knowledge and best practices. It seems to me that in psychology, many people seem to stick with what they learned in their graduate school or postdoctoral labs—the same experimental methodologies the same data analysis techniques, and the same rules of thumb about sample sizes, expected effect sizes, and statistical power. In particular, statistical power is one of those concepts that social scientists have to think about all the time, but about which we possess very poor intuitions. Kahneman and Tversky famously showed this in their seminal 1971 paper on “Belief in the Law of Small Numbers”. The fact that four decades later, many researchers do not understand some basic facts about p-values and sample sizes, testifies to how hard it is to assimilate this kind of abstract knowledge into everyday scientific thinking.
I have personally become much more aware of how psychology research is dramatically underpowered relative to the confidence we have in its conclusions. I think it wouldn’t be far off to say that more than half of the inferences drawn in psychology journal articles—and a much higher percentage of those reported in popular coverage of the journal articles—are either wrong or insufficiently justified by the underlying data. Just the other day I read an article in a science magazine about a study that claimed to have found a gene associated with participants’ trustworthiness, which was measured by having observers watch videos of the participants’ behavior. The length of the video clips is mentioned (20 seconds), but not the number of observers (116), the size of the effect (a bit more than two thirds of a standard deviation), or the most crucial number of all: how many participants provided DNA and video clips. It turns out that number, the sample size, is twenty-three. Twenty-three! There is little chance that this result will replicate in larger samples, since it normally requires tens of thousands of participants—that is, about three orders of magnitude more than in this study—to have sufficient power to detect a genetic association with a behavioral phenotype.
I try to increase the sample sizes in my own studies whenever possible. My own rules of thumb about sample sizes probably call for about twice as many participants in a typical study as I would have used for the same thing ten years ago. Some of my papers have involved very large samples; for example, we have an article in press at Psychological Science that used three samples totaling almost 10,000 people to show that most of the published associations of specific genetic variants with general intelligence simply do not replicate. The effects are either false positives, or orders of magnitude smaller than originally thought. Of course, it’s hard to bring 10,000 people through a traditional psychology lab; this research is based on analyzing pre-existing datasets. Collaborating with economists has helped open my eyes to the value—and challenges—of working with this kind of data rather than sticking with smaller-scale lab methods.
Avoiding insensitivity to range restriction—the error of estimating a correlation based on a set of data that doesn’t include the full range of each variable—has also become more prominent in my thinking. This is the mistake that people like David Brooks and Malcolm Gladwell make when they say that intelligence can’t have much to do with success because most Nobel prizewinners didn’t go to Harvard and MIT, or because most CEOs aren’t geniuses, and so on. The problem is that a very large correlation across the whole range of measurement (e.g., the relationship between intelligence and achievement) will seem attenuated when one looks at only part of the range. This is counterintuitive, perhaps because we think of correlations like lines rather than clouds, but in any case requires careful thought to understand properly. I try to avoid doing individual differences research on college student samples, or at least to replicate it using less homogeneous groups (e.g., people on the street, online samples, etc.), in part because of range restriction. For example, to get the most generalizable results we could, Dan Simons and I conducted a telephone survey of misconceptions about memory and thinking using a sample weighted to be representative of the U.S. census demographics.
Research on individual differences is an inherently observational enterprise most of the time (it is hard to randomly assign people to have different personality traits, abilities, genes, etc.), so I try to remind myself, and my students, of what we can and cannot conclude from such research. I am also trying to be more self-critical in this respect. My friend Russ Poldrack recently agreed with a commentary that pointed out weaknesses in a study he had co-authored several years ago. If someone should point out that a study of mine doesn’t replicate (when repeated with adequate statistical power!), was poorly designed or analyzed, or contains overstated conclusions, I hope that I will be as honorable as Russ was.
I should add that although I have emphasized the correlation/causation distinction in my review of your book and in this interview, I don’t think that we should throw up our hands and conclude nothing any time we don’t have the gold standard evidence (e.g., a properly-designed, large-sample RCT) of causation. Natural experiments and instrumental variable methods can substitute, and have the added advantage of being field studies. We can have more confidence in a causal story the more confounding factors we rule out or properly control. We can have more confidence in our study conclusions the larger the sample size and the more independent replications have been conducted. My goal in this interaction has been to draw attention to issues about what we can and should conclude from work in neuroscience and social science. For researchers and writers alike (myself included), it’s not just about telling stories that are exciting, counterintuitive, and world-changing in their implications. It’s also about appreciating and communicating the extent to which those stories are true.