The Hidden Moderator Part I – Do we pay sufficient attention to piloting in replication studies?

Blog post written by Hans IJzerman & Ivan Ropovik

What constitutes a successful replication (1)? While a seemingly easy question, what the correct answer is has been rather controversial. The question as to what constitutes a successful replication has spurred intense debates about hidden moderators (Most Psychology Twitter Users, 2016-2017) and context sensitivity, often to the point of escalation. One prominent example was the Reproducibility Project Psychology (RP:P), which successfully replicated 36% of 100 psychology experiments (2). The main criticisms were directed at whether the original studies were successfully redesigned or not to test the original ideas. In this blog post we discuss the importance of piloting by reporting on our first pilot study of the renewed replication attempt of Förster et al. (2008), which is part of a follow up project to RP:P: Many Labs 5.

Does evaluating psychological research require more than attention to the N?

One frequent criticism that has been directed at replication research is that “replicators” seek to repeat the design of the original study, rather than paying attention to the psychological mechanism being tested. Both Schwarz and Clore (2016) and Gilbert et al. (2016) direct their criticism toward insufficient attention being devoted to the psychological mechanism under study.  Schwarz and Clore (2016) for example state that the “implementation of independent variables is frequently a mix of earlier results and personal intuition, further highlighting the need for sensible manipulation checks and converging evidence across different manipulations”.

The oft heard response to this criticism is that the efficacy of any replication attempt can be thought of as insufficient post-hoc and by invoking these so-called “hidden moderators” regardless how likely other explanations of the replication result may be, the falsification of the original study is near to impossible. We are sympathetic to both of these views. However, for a scientifically valid theory, “personal intuition” (or undisclosed auxiliary assumptions; Earp & Trafimow, 2015) is insufficient (3). This means that in past (original) research, designs and measures have usually been suboptimal to specify the necessary preconditions required to successfully test psychological theories.

Brandt et al.’s (2014) Replication Recipe provides one solution, as they discuss the worth of original materials and how we can determine what constitutes a close versus conceptual replication. In stating that materials for a replication study can be adapted, they stipulate that manipulations and measures should be piloted prior to the actual replication attempt (and that the assumptions regarding the meaning of the stimuli should be clearly specified a priori). These differences can be captured in a pre-registered Replication Recipe, integrated into the Open Science Framework (4).

Many Labs 5: A Follow Up to the RP:P

But why change the original materials to fit a new context? Schwarz and Clore (2016) correctly pointed to the fact that from RP:P “11 replications used procedures that the original authors considered inappropriate prior to data collection; 10 of them failed (5).“ This led a group of researchers, led by Charlie Ebersole, to design a follow up, in which the main question is whether protocols approved by original authors (or similarly qualified experts) would lead to greater replication success than non-approved protocols.

One of the failed replications where the original authors disagreed on the protocol was the study by Förster et al. (2008), for which we have designed a second follow up replication. In the original paper, the authors found that after priming a concept (such as “aggressive”), people assimilated that concept into their social judgments (e.g., rate a person as being more aggressive compared to a control condition) versus contrast their judgment away from the prime (e.g., rate the person as being less aggressive compared to a control condition). In this original study, people solved a word puzzle priming aggressiveness (vs. control), followed by a global (vs. local) prime by attending to a larger map (versus details of the map), after which they rated a character in a scenario on its aggressiveness. A close replication by Reinhard (2015) could not detect the same effects.

The original authors’ provided post-hoc criticism; this criticism was directed toward the efficacy of the prime (applicability) and the ambiguity of the scenario (target ambiguity). And they seemed to be right: Indeed, the prime did not work and the scenarios were not regarded as ambiguous by the participants in the replication study. These auxiliary assumptions are thus now turned into falsifiable predictions.

Putting the idea to the test: Target Ambiguity

In order to put the authors’ theory to the test, we designed a set of follow up studies to replicate the original studies. Specifically, we set forth to conduct three studies, in which we 1) pilot the scenarios for target ambiguity, 2) test the efficacy of the aggression prime, and 3) test the combination of the aggression (vs. control) prime and the global (vs. local) prime. We developed a new protocol that was approved by the original authors. We will test this new protocol against the replication protocol developed by Reinhard (2015) (6).

We developed a protocol to specify a number of things, first of which was “target ambiguity”. The original authors criticized the previous replication attempt, because the target was not sufficiently ambiguous. Our follow up study spans six countries (Slovakia, Germany, United States, United Kingdom, Poland, and Brazil) at nine universities (7). In order to generate scenarios that were likely to meet the goal of “target ambiguity” we let each site generate 3-4 scenarios. Out of 26 scenarios in total, we selected 21 to pilot in this first round (8). We agreed with the original authors that we would select scenarios that are nonsigificantly different from the neutral point of the scale (9,10).

So far, we have finished the first round of pilots for 7 out of 9 sites. The results from the first pilot indeed shows that target ambiguity means we select different scenarios in different countries (full reports of results available here). Additionally, because we did not reach a sufficient amount of scenarios for one site, we will redo one pilot. Our Brazilian participants at the University of Fortaleza perceived all scenarios as more aggressive than neutral point (and we suspect that Brazilians are more sensitive to aggression than the other sites because of honor concerns, see e.g., Cohen et al., 1999; Vandello & Cohen, 2003). We will thus rerun the Brazilian pilot with slightly less aggressive scenarios.

In Conclusion

What have we learnt so far? We suspect that the authors were right in their criticism that target ambiguity should be more carefully assessed prior to starting the replication study. And that, yes, Brandt et al. (2014) were right in that going from one (original) study to the next (replication), careful piloting is required to be able to test the original theory. At the same time, we think that most prior work has been suboptimal in specifying – a priori – what the necessary preconditions are to successfully test a theory. Much of such piloting has not been done explicitly or the importance of such piloting has not been communicated/defined. We hope that our current set of pilot and replication studies contribute to incorporating piloting as integral part of replication projects. Ultimately, this should lead to a reduction of criticisms on whether (original and replication) studies are done correctly, and it will also help reduce blaming replication failures on hidden moderators as post-hoc explanations. That being said, our current sets of pilots do not (yet) say anything about whether Förster et al.’s (2008) original theory was correct, but our pilots will help us increase the diagnostic value of our renewed replication attempt.  



  1. The answer to what constitutes a successful replication is likely as difficult as what constitutes a successful original experiment. We thus want to be clear that the steps we outline in this blog post are similarly necessary for conducting original experiments, and previous work has failed short (see also for helpful tips on how to make original research more reproducible).
  2. There has been considerable discussion about this percentage, which also points to the difficulties of conducting replications (and doing science, more generally). First, it is often reported as only 36% of the experiments could be reproduced. This is incorrect; as the project provides insufficient information whether it is possible to reproduce the studies (a variety of factors, including failure on the side of the replicators, can cause a failed replication). The present project goes to address this potential problem (i.e., reducing the chance that a replication fails because of failure to capture the original mechanism). Second, the percentage that was replicated is also erroneously interpreted, both by replicators and by commentators on RP:P. Based on the overlap of confidence and prediction intervals, Gilbert et al. (2016) and Patil, Peng, and Leek (2016) suggest that percentage of studies that were successfully replicated is much higher: They thought that replication rates were much higher, because as high as 77% of the replication effect sizes fell within a 95% prediction interval based on the original effect sizes. This can easily lead to overly optimistic and misleading interpretations: Due to low statistical power in the original studies and wide expected variability associated with reported estimates, original studies rarely make falsifiable effect size predictions (Morey & Lakens, 2016, see also Etz & Vandekerckhove, 2016). While the approach of RP:P commentators renders a replication successful if the result is within the confidence interval of the original study, the replicators thought of replications to have failed if the original effect size was outside of the replication’s confidence interval. So far the account that we think is most accurate Simonsohn’s (2016), who stated that “the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 were inconclusive”, avoiding inappropriate dichotomous decisions about the success of the replication.
  3. In a recent replication attempt, we also ran into a similar issue where a neglected auxiliary assumption is likely important for a replication to succeed. Specifically, Szymkow et al. (2013) found that priming participants with communal (vs. agentic) traits lead to higher temperature perceptions. This seemed not to replicate in Ebersole et al. (2016). After re-analyzing the data, IJzerman et al. (2016) concluded that Ebersole et al.’s (2016) lab temperatures were too high to detect the original effect, and that the effect did replicate under lower lab temperatures with temperature estimates very comparable to the original study. This means that the conditions were insufficiently specified in the original paper to achieve a successful replication. In the empirical cycle, this latter analysis can now be considered a post-hoc interpretation and should then again be tested in follow up research.
  4. Note that we also understand that exactly because we have always valued innovation over solidity, we – as the field – have not paid attention to factors determining replication, and this lack of attention – together with a pretty solid lack/lag of integrating new technologies (Spellman, 2016) – is the main reason that we have a pretty bad specification of which factors help determine successful replications. This specification probably requires a long cycle of replications in which some theories will become more accurately specified, and others will fall (cf. IJzerman et al., 2013).
  5. This might not fully be correct. Original authors did not provide explicit endorsement of protocols in RP:P. Rather, the replication teams, prior to running their study, provided their assessment as to whether or not the original authors endorsed the protocol. Nonetheless, these studies can be characterized as those where concerns raised by original authors about the study could not be resolved prior to data collection.
  6. There is one additional point to be made about replicability and that is the issue of appropriating the right manipulations and the right measurements. Most social science, and perhaps psychology in particular, suffers from poor measurement and poor sampling of experimental stimuli. What we have not discussed so far is that we also slightly changed the design of this study (with approval of the original authors). In recent work, Westfall et al. (2015) justifiably criticize that – just like participants – stimuli have an inherent variation as they are sampled from a larger “universe” of stimuli (see also Fiedler, 2000). In our second pilot, we randomly present 1 out of 4 selected scenarios to our participants and we will do the same for the main study; we will discuss both the merits of stimuli sampling and measurement when we report on our second pilot.
  7. The team that has contributed to this project consists of Tiago Lima, Luana Souza (both University of Fortaleza), Michal Bialek,Przemysław Sawicki,Łukasz Markiewicz, Katarzyna Gawryluk (all Kozminzki University), Sarah Novak (Hofstra University), Sue Kraus, Natasha Lewis (both Fort Lewis College), Ivan Ropovnik, Peter Banincak, Gabriel Banik (University of Presov), Daniel Wolf, Astrid Schuetz (both Otto Friedrich Universität Bamberg), Leanne Boucher Gill, Timothy Razza, Madhavi Menon, Weylin Sternglanz, Matt Collins (Nova Southeastern University), Sophia Weissgerber (Kassel University), and Gavin Sullivan (Coventry University).
  8. The original study’s effect size suggested that we would need only 31 participants to detect the interaction effect. This should mean that even less participants would be necessary in a pilot. However, to be sure, we specified to need (at least) 50 participants to take part in our first pilot.
  9. For the case when we obtain larger than the specified number of scenarios using this criterion, we added an additional, more stringent criterion: We specified to choose scenarios that provide the highest relative evidence in favor of the null hypothesis (Bayes factor).
  10. It should be noted that the original authors have been extremely generous in donating their time to the replication and have been very cooperative in helping us to set up this replication attempt.