Tuesday, May 16, 2017

Sometimes You Can Step into the Same River Twice

              A recurring theme in the replication debate is the argument that certain findings don’t replicate or cannot be expected to replicate because the context in which the replication is carried out differs from the one in which the original study was performed. This argument is usually made after a failed replication.

In most such cases, the original study did not provide a set of conditions under which the effect was predicted to hold, although the original paper often did make grandiose claims about the effect’s relevance to variety of contexts including industry, politics, education, and beyond. If you fail to replicate this effect, it's a bit like you've just bought a car that was touted by the salesman as an "all-terrain vehicle," only to have the wheels come off as soon as you drive it off the lot.*

            As this automotive analogy suggests, the field has two problems: many effects (1) do not replicate and (2) are grandiosely oversold. Dan Simons, Yuichi Shoda, and Steve Lindsay have recently made a proposal that provides a practical solution to the overselling problem: researchers need to include in their paper a statement that explicitly identifies and justifies the target populations for the reported findings, a constraints on generality (COG) statement. Researchers also need to state whether they think the results are specific to the stimuli that were used and to the time and location of the experiment. Requiring authors to be specific about the constraints on generality is a good idea. You probably wouldn't have bought the car if the salesman had told you its performance did not extend beyond the lot. 

          A converging idea is to systematically examine which contextual changes might impact which (types of) findings. Here is one example. We always assume that subjects are completely naïve with regard to an experiment, but how can we be sure? On the surface, this is primarily a problem that vexes on-line research using databases such as Mechanical Turk, which has forums on which subjects discuss experiments. But even with the good old lab experiment we cannot always sure that our subjects are naïve to the experiment, especially when we try to replicate a famous experiment. If subjects are not blank slates with regard to an experiment, a variation of population has occurred relative to the original experiment. We've gone from sampling from a population of completely naïve subjects to sampling from one with an unknown percentage of repeat-subjects.

            Jesse Chandler and colleagues recently examined whether prior participation in experiments affect effect sizes. They tested subjects in a number of behavioral economics tasks (such as sunk cost and anchoring and adjustment) and then retested these same individuals a few days later. Chandler et al. found an estimated 25% reduction in effect size, suggesting that the subjects’ prior experience with the experiment did indeed affect their performance in the second wave. A typical characteristic of these experiments is that they require reasoning, which is a controlled process. How about tasks that tap more into automatic processing?

             To examine this question, my colleagues and I examined nine well-known effects in cognitive psychology, three from the domain of perception/action, three from memory, and three from language. We tested our subjects in two waves, the second wave three days later than the first one. In addition, we used either the exact same stimulus set or a different set (with the same characteristics, of course).

            As we expected, all effects replicated easily in an online environment. More importantly, in contrast to Chandler and colleagues' findings, repeated participation did not lead to a reduction in effect size in our experiments. Also, it did not make a difference if the exact same stimuli were used or a different set.

            Maybe you think that this is not a surprising set of findings. All I can say that before running the experiments, our preregistered prediction was that we would obtain a small reduction of effect sizes (smaller than the 25% of Chandler et al.). So we at least were a little surprised to find no reduction.

            A couple of questions are worth considering. First, do the results indicate that the initial participation left no impression whatsoever on the subjects? No, we cannot say this. In some of the response-time experiments, for example, we obtained faster responses in wave 2 than in wave 1. However, because the responses also became less varied in their performance, the effect size did not change appreciably. A simple way to put it would be to say that the subjects became better at performing the task (as they perceived it) but remained equally sensitive to the manipulation. In other cases, such as the simple perception/action tasks, responses did not speed up, presumably because subjects were already performing at asymptote level.

            Second, how non-naïve were our subjects in wave 1? We have no guarantee that the subjects in wave 1 were completely naïve with regard to our experiments. What our data do show, though, is that the 9 effects replicate in an online environment (wave 1) and that repeating the experiment a mere few days later (wave 2) by the same research group does not reduce the effect size.

           So, in this sense, you can step into the same river twice. 

* Automotive metaphors are popular in the replication debate, see also this opinion piece in Collabra: Psychology by Simine Vazire.


Monday, May 8, 2017

Concurrent Replication

I’m working on a paper with Alex Etz, Rich Lucas, and Brent Donnellan. We had to cut 2,000 words and the text below is one of the darlings we killed. I’m reviving it as a blog post here because even though it made sense to cut the segment from the manuscript (I cut it myself, the others didn’t make me), the notion of concurrent replication is an important one.

The current replication debate has, for various reasons, construed replication as a retrospective process. A research group decides to replicate a finding that is already in the published literature. Some of the most high-profile replication studies, for example, have focused on findings published decades earlier, for example the registered replication projects on verbal overshadowing (Alogna et al, 2014) and facial feedback (Wagenmakers et al., in press). This retrospective approach, however timely and important, might be partially responsible for the controversial reputation that replication currently enjoys.
A form of replication that has received not much attention yet is what I will call concurrent replication. The basic idea is this. A research group formulates a hypothesis that they want to test. At the same time, they desire to have some reassurance about the reliability of the finding they expect to obtain. They decide to team up with another research group. They provide this group with a protocol for the experiment, the program and stimuli to run the experiment, and the code for the statistical analysis of the data. The experiment is preregistered. Both groups then each run the experiment and analyze the data independently. The results of both studies are included in the article, along with a meta-analysis of the results. This is the simplest variant. A concurrent replication effort can involve more groups of researchers.
A direct exchange of experiments (a straight “study swap”) is the simplest model of concurrent replication. It is possible to accomplish such study swaps on a larger scale where participating labs offer and request subject hours. This will likely result in a network of labs each potentially simultaneously engaged in forming and testing novel hypotheses as well as concurrently replicating hypotheses formed by other labs. The Open Science Framework features a site that has recently been developed to facilitate concurrent replication, Study Swap, see also this article.  At the time of this writing, there are four projects listed on Study Swap. We hope this number will increase soon.
Aside from this, there already are several large-scale concurrent replication efforts. An example is the Pipeline Project, a systematic effort to conduct prepublication replications, independently performed by separate labs. The first instalment was recently published (Schweisberg et al. 2016) and a second project is underway.
Concurrent replication has several advantages. First, researchers have a better sense of the reliability of their findings prior to publication.  After all, the results have been independently replicated before submission of the article. Likewise, journal editors and reviewers will have more confidence in the findings reported in the manuscript they are asked to evaluate. Journals have the luxury of publishing findings that have already been independently replicated. As a result, the reproducibility of the findings in the literature will start to increase. The Schweisberg et al. (2016) study demonstrates that concurrent replication is not only possible but also useful.
Concurrent replication forces researchers to be explicit about the procedure by which they expect to obtain the effect. If they do indeed obtain the finding both in the original study and in an independent replication, they have what amounts to a scientific finding according to the criteria established by Popper: They can describe a procedure by which the finding can reliably be produced. It will be easy and natural to include the protocol into the method section of the article. A positive side-effect of this will be a marked improvement in the quality of method sections in the literature. As a result, researchers who want to build on these findings have two advantages that researchers currently do not enjoy. First, they can build on a firmer foundation. After all, the reported finding has already been independently replicated. Second, a replication recipe doesn’t have to be laboriously reconstructed. It is readily available in the article.
Of course, concurrent replication is not without challenges. For instance, how should authorship be determined given such an arrangement? A flexible approach is best here. At one extreme the original group’s hypothesis might be very close to the replicating group’s own interest. In this case it would therefore be logical to make members of both groups co-authors; each group may have something to add to the paper both in terms of data and analysis and in terms of theory. At the other extreme, the second group has no direct interest in the hypothesis but may be willing to run a replication, perhaps in exchange for a replication of one of their own experiments. In this case it might be sufficient to acknowledge the other group’s involvement without offering co-authorship.
Thus far, the discussion here has only involved a scenario in which the hypothesis is supported in both the initiating as in the replicating lab. However, other scenarios are also possible. The second scenario is one in which the hypothesis is supported in one of the labs but not in the other. If the meta-analysis shows heterogeneity among the findings, researchers might hypothesize about a potential difference between the experiments, preregister that hypothesis and test it, again with a direct replication. If the meta-analysis does not show heterogeneity, it might be decided that it is sufficient to report the meta-analytic effect. If neither lab shows the effect, the research groups might report the results without engaging in follow-up studies. Alternatively, they might decide the experimental procedure was suboptimal, revise it, preregister the new experiment and run it, along with one or more concurrent replications.
To summarize, concurrent replication forms an underrepresented but potentially extremely valuable form of replication. Several concurrent large-scale replication efforts are currently underway and a platform that also facilitates conducting smaller-scale projects is available for use. The fact that concurrent replications are often viewed positively by the field is further evidence of the importance of replication for scientific endeavors.


Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., ... Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.
Schweinsberg, M. et al. (2016). The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. journal of experimental social psychology, 66, 55–67.
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., . . . Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928.

Saturday, March 4, 2017

The value of experience in criticizing research

It's becoming a trend: another guest blog post. This time, J.P. de Ruiter shares his view, which I happen to share, on the value of experience in criticizing research.

J.P. de Ruiter
Tufts University

One of the reasons that the scientific method was such a brilliant idea is that it has criticism built into the process. We don’t believe something on the basis of authority, but we need to be convinced by relevant data and sound arguments, and if we think that either the data or the argument is flawed, we say this. Before a study is conducted, this criticism is usually provided by colleagues, or in case of preregistration, reviewers. After a study is submitted, critical evaluations are performed by reviewers and editors. But even after publication, the criticism continues, in the form of discussions in follow-up articles, at conferences, and/or on social media. This self-corrective aspect of science is essential, hence criticism, even though at times it can be difficult to swallow (we are all human) is a very good thing. 

We often think of criticism as pointing out flaws in the data collection, statistical analyses, and argumentation of a study. In methods education, we train our students to become aware of the pitfalls of research. We teach them about assumptions, significance, power, interpretation of data, experimenter expectancy effects, Bonferroni corrections, optional stopping, etc. etc. This type of training leads young researchers to become very adept at finding flaws in studies, and that is a valuable skill to have.  

While I appreciate that noticing and formulating the flaws and weaknesses in other people’s studies is a necessary skill for becoming a good critic (or reviewer), it is in my view not sufficient. It is very easy to find flaws in any study, no matter how well it is done. We can always point out alternative explanations for the findings, note that the data sample was not representative, or state that the study needs more power. Always. So pointing out why a study is not perfect is not enough: good criticism takes into account that research always involves a trade-off between validity and practicality. 

As a hypothetical example: if we review a study about a relatively rare type of Aphasia, and notice that the authors have studied 7 patients, we could point out that a) in order to generalize their findings, they need inferential statistics, and b) in order to do that, given the estimated effect size at hand, they’d need at least 80 patients. We could, but we probably wouldn’t, because we would realize that it was probably hard enough to find 7 patients with this affliction to begin with, so finding 80 is probably impossible. So then we’d probably focus on other aspects of the study. We of course do keep in mind that we can’t generalize over the results in the study with the same level of confidence as in a lexical decision experiment with a within-subject design and 120 participants. But we are not going to say, “This study sucks because it had low power”. At least, I want to defend the opinion here that we shouldn’t say that. 

While this is a rather extreme example, I believe that this principle should be applied at all levels and aspects of criticism. I remember that as a grad student, a local statistics hero informed me that my statistical design was flawed, and proceeded to require an ANOVA that was way beyond the computational capabilities of even the most powerful supercomputers available at the time. We know that full LMM models with random slopes and intercepts often do not converge. We know that many Bayesian analyses are intractable. In experimental designs, one runs into practical constraints as well. Many independent variables simply can’t be studied in a within-subject design. Phenomena that only occur spontaneously (e.g. iconic gestures) cannot be fully controlled. In EEG studies, it is not feasible to control for artifacts due to muscle activity, hence studying speech production is not really possible with this paradigm.

My point is: good research is always a compromise between experimental rigor, practical feasibility, and ethical considerations. To be able to appreciate this as a critic, it really helps to have been actively involved in research projects. Not only because that gives us more appreciation of the trade-offs involved, but also, perhaps more importantly, of the experience of really wanting to discover, prove, or demonstrate something. It makes us experience first-hand how tempting it can be, in Feynman’s famous formulation, to fool ourselves. I do not mean to say that we should become less critical, but rather that we become better constructive critics if we are able to empathize with the researcher’s goals and constraints. Nor do I want to say that criticism by those who have not yet have had positive research experience is to be taken less seriously. All I want to say here is that (and why) having been actively involved in the process of contributing new knowledge to science makes us better critics.