The big news this week is a fresh study in Nature that reports the results of a team that sought to replicate 21 high profile experiments in social psychology, all originally published by the journals Nature or Science between the years 2010 and 2015. The study has garnered a lot of headlines. You can read takes by Science Magazine, The Washington Post, Ars Techica, The Atlantic, Science Trends, and many others with a bit of google searching. Popular interest is driven by the study’s result: the research team was only able to replicate 13 out of the 21 experiments.
I am going to assume that readers are familiar with the general outlines of the “reproducibility crisis” (if you are not, Susan Dominus’ New York Magazine long-read on the crisis, “When the Revolution Came for Amy Cuddy” is a good place to start). What is most interesting about this study is not that they found more experiments that failed to replicate. These days that is old hat. What is new about this study is that the experimenters asked a pool of 200 psychologists to predict which studies would fail to replicate and which ones would not. They did this both by survey and by prediction market. What did they discover? This graphic (pulled from the paper in Nature) tells the story:
You will notice that researchers did a very good job predicting which studies would fail to replicate. The studies the majority predicted would fail to replicate were the same studies that actually failed to replicate. What does this mean? Psychologists can tell the difference between good studies and bad ones. But that raises another question: if psychologists can sift the wheat from the chaff, why is so much chaff being published?
I have read some uncharitable answers to this question on Twitter. I think these answers are unnecessarily uncharitable. But before I explain why, let me offer you a challenge: go visit the website 80,000 Hours and see if you can predict which experiments will replicate and which will not. The folks at 80,000 have created a neat quiz which presents the results, methodology, and sample size of each experiment to you, and allows you to guess if the results were replicated before you see the real results.
OK, are we back? I took the quiz before I read the original paper or any of the news coverage about it. Despite this, I got on almost perfect score—I only guessed two wrong, and both of those I labeled as “not sure.” How did I score so well? My predictions followed a rough rule of thumb: if the study 1) involved “priming,” or 2) seemed to fly against my own experience dealing with humans in day to day life, I predicted it would not replicate.
You can find a suitable definition of “priming” at NeuroSkeptic. Basically, it refers to attempts to unconsciously influence perception and decision making by exposing subjects to subtle stimuli. For example, there is a famous set of studies that found placing a picture of eyes on a wall will increase the honesty and generosity of those exposed to them. You can criticize studies like this from two angles: on the one hand, this simply does not seem to describe how the actual humans you know go about living their lives. That is one of the reasons studies like this garner so much attention. They are counter-intuitive. In the age of the TED talk, cleverly subverting people’s intuitions is high prestige endeavor. But I tend to be extremely skeptical of any psychological study that makes unusually counter-intuitive claims. Why? Because for the greater part of humanity’s evolutionary history, the single most important selection pressure put on human beings was the ability to intuit the behavior and intentions of other human beings. Being able to understand and predict other humans’ behavior is critical to our survival. It is something we are naturally good at (though only a few very perceptive and articulate individuals are skilled at communicating these intuitions to others). I am thus usually very suspicious of any study which claims that our intuitions have led us to make faulty assessments of others’ behavior.
My demand for especially strong evidence when priming studies are conducted is also informed by advances in other fields of the behavioral sciences. Over the last two decades, there has been a substantial amount of research done on the relationship between genetics and behavior, hormones and behavior, and life history and behavior. All three streams of research suggest that a lot of our behavior (say, our propensity to be honest) is determined days and years before the actual moment of decision. It is difficult, though not entirely impossible, to square this research with social priming studies that suggest that humans live in constant churn, buffeted about by a never-ending stream of imperceptible stimuli.
Now I want to be clear here: none of the above means I reject all counter-intuitive findings, or even all social priming findings, out of hand. But it does mean I ask for an unusually high standard of evidence before accepting them. But I do not think I would have demanded this same standard of evidence back in 2010. This is why I am less harsh on the editors of Science and Nature than many seem to be. By 2016 it was clear that those “surveillance cue” studies that had psychologists pinning eyes up on walls were failing to replicate. In 2010 the replication wave had not yet hit, and scientists were not trained to ask themselves whether their studies would replicate. Things have changed. Psychologists now constantly ask themselves if their studies will replicate; more importantly, they have a large body of failed studies to learn from. If you have paid any attention to these developments you will have learned which kind of studies do not replicate: those with tiny sampling sizes, those that rely on superficial social priming, and those whose results are counter-intuitively flashy. But this body of failed studies did not exist in 2010. It is not fair to judge the scientists of that time against data that has only become available in ours.