Report

Evaluating replicability of laboratory experiments in economics

See allHide authors and affiliations

Science  25 Mar 2016:
Vol. 351, Issue 6280, pp. 1433-1436
DOI: 10.1126/science.aaf0918

Another social science looks at itself

Experimental economists have joined the reproducibility discussion by replicating selected published experiments from two top-tier journals in economics. Camerer et al. found that two-thirds of the 18 studies examined yielded replicable estimates of effect size and direction. This proportion is somewhat lower than unaffiliated experts were willing to bet in an associated prediction market, but roughly in line with expectations from sample sizes and P values.

Science, this issue p. 1433

Abstract

The replicability of some scientific findings has recently been called into question. To contribute data about replicability in economics, we replicated 18 studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. All of these replications followed predefined analysis plans that were made publicly available beforehand, and they all have a statistical power of at least 90% to detect the original effect size at the 5% significance level. We found a significant effect in the same direction as in the original study for 11 replications (61%); on average, the replicated effect size is 66% of the original. The replicability rate varies between 67% and 78% for four additional replicability indicators, including a prediction market measure of peer beliefs.

The deepest trust in scientific knowledge comes from the ability to replicate empirical findings directly and independently. Although direct replication is widely applauded (1), it is rarely carried out in empirical social science. Replication is now more important than ever, because the quality of results has been questioned in many fields, such as medicine (25), neuroscience (6), and genetics (7, 8). In economics, concerns about inflated findings in empirical (9) and experimental analyses (10, 11) have also been raised. In the social sciences, psychology has been the most active in both self-diagnosing the forces that create “false positives” and conducting direct replications (1215). Several high-profile replication failures (16, 17) quickly led to changes in journal publication practices (18). The recent Reproducibility Project: Psychology (RPP) replicated 100 original studies published in three top journals in psychology. The vast majority (97) of the original studies reported “positive findings,” but in the replications, the RPP only found a significant effect in the same direction for 36% of these studies (19).

In this report, we provide insights into the replicability of laboratory experiments in economics. Our sample consists of all 18 between-subject laboratory experimental papers published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. The most important statistically significant finding, as emphasized by the authors of each paper, was chosen for replication (see section 1 of the supplementary materials and tables S1 and S2). We used replication sample sizes with at least 90% power (mean = 92%; median = 91%) to detect the original effect size at the 5% significance level. All of the replication and analysis plans were made public on the project website (supplementary materials, section 1) and were also sent to the original authors for verification.

There are different ways of assessing replication, with no universally agreed-upon standard of excellence (1923). We present results for the same replication indicators that were used in the RPP (19). As our first indicator of replication, we used a “significant effect in the same direction as in the original study” [Gelman and Stern (20) discuss the challenges of comparing significance levels across experiments].

The results of the replications are shown in Fig. 1A and table S1. We found a significant effect in the same direction as in the original study for 11 replications (61.1%). This is considerably lower than the replication rate of 92% (mean power) that would be expected if all original effects were true and accurately estimated (one-sample binomial test, P < 0.001).

Fig. 1 Replication results.

(A) Plotted are 95% CIs of replication effect sizes (standardized to correlation coefficients). The standardized effect sizes are normalized so that 1 equals the original effect size (fig. S1 shows a nonnormalized version). Eleven replications have a significant effect in the same direction as in the original study [61.1%; 95% CI = (36.2%, 86.1%)]. The 95% CI of the replication effect size includes the original effect size for 12 replications [66.7%; 95% CI = (42.5%, 90.8%)]; if we also include the study in which the entire 95% CI exceeds the original effect size, this increases to 13 replications [72.2%; 95% CI = (49.3%, 95.1%)]. AER denotes the American Economic Review and QJE denotes the Quarterly Journal of Economics. (B) Meta-analytic estimates of effect sizes, combining the original and replication studies. Plotted are 95% CIs ofcombined effect sizes (standardized to correlation coefficients). The standardized effect sizes are normalized as in (A) (fig. S1 shows a nonnormalized version). Fourteen studies have a significant effect in the same direction as the original study in the meta-analysis [77.8%; 95% CI = (56.5%, 99.1%)].

A complementary method for assessing replicability is to test whether the 95% confidence interval (CI) of the replication effect size includes the original effect size (19) [Cumming (21) discusses the interpretation of CIs for replications]. This is the case in 12 of our replications (66.7%). If we also include the study in which the entire 95% CI exceeds the original effect size, the number of replicable studies increases to 13 (72.2%). An alternative measure, which acknowledges sampling error in both the original study and the replications, is to count how many replicated effects lie in a 95% “prediction interval” (24). This count is higher (83.3%) and increases to 88.9% if we also include the replication whose effect size exceeds the upper bound of the prediction interval (fig. S2 and supplementary materials, section 2).

The mean standardized effect size (correlation coefficient, r) of the replications is 0.279, compared with 0.474 in the original studies (fig. S3). This difference is significant [Wilcoxon signed-rank test; z = –2.98, P = 0.003, n = 18]. The replicated effect sizes tend to be of the same sign as the original ones but not as large. The mean relative effect size of the replications is 65.9%.

The original and replication studies can also be combined in a meta-analytic estimate of the effect size (19). As shown in Fig. 1B, in the meta-analysis, 14 studies (77.8%) have a significant effect in the same direction as in the original study. These results should be interpreted cautiously, because the estimates assume that the results of the original studies do not have publication or reporting biases.

To measure peer beliefs about the replicability of original results, we set up prediction markets before the 18 replications were performed (25). Dreber et al. (26), in a recent study that presented evidence for a subset of the replications in the RPP, proposed the use of prediction markets as an additional replicability indicator. In the prediction market for a particular target study, peers who were likely to be familiar with experimental methods in economics could buy or sell shares whose monetary value depended on whether the target study was replicated (fig. S4 and tables S1 and S2). The prediction markets produce a collective market probability of replication (27) that can be interpreted as a replicability indicator (26). The traders’ (n = 97) survey beliefs about replicability were also collected before market trading as an additional measure of peer beliefs.

The average prediction market belief is a replication rate of 75.2%, and the average survey belief is 71.1% (Fig. 2, fig. S5, and tables S3 and S4). Both are higher than the observed replication rate of 61.1%, but neither difference is significant (supplementary Materials, section 5). The prediction market beliefs and the survey beliefs are highly correlated, and both are positively correlated with the ranked degree of replication success, although the correlation does not reach significance for the prediction market beliefs (Fig. 2 and fig. S6). Contrary to Dreber et al. (26), prediction market beliefs are not a more accurate indicator of replicability than survey beliefs.

Fig. 2 Prediction market and survey beliefs.

A plot of prediction market beliefs and survey beliefs, in relation to whether the original result was replicated with P < 0.05 in the original direction. The mean prediction market belief in a successful replication is 75.2% [range, 59% to 94%; 95% CI = (69.7%, 80.6%)], and the mean survey belief is 71.1% [range, 54% to 86%; 95% CI = (66.4%, 75.8%)]. The prediction market beliefs and survey beliefs are highly correlated (Spearman correlation coefficient = 0.79, P < 0.001, n = 18). Both the prediction market beliefs (Spearman correlation coefficient = 0.30, P = 0.232, n = 18) and the survey beliefs (Spearman correlation coefficient 0.52, P = 0.028, n = 18) are positively correlated with the ranked degree of replication success.

We also tested whether replicability is correlated with two observable characteristics of published studies: the P value and the sample size (number of participants) of the original study. These two characteristics are likely to be correlated with each other, which is the case for our 18 studies (Spearman correlation coefficient = –0.61, P = 0.007, n = 18). We expected the replicability to be negatively correlated with the original P value and positively correlated with the sample size, because the risk of false positives increases with the original P value and decreases with the original sample size (statistical power) (6, 11). The correlations are presented in Fig. 3 and table S5, and the results are in line with our expectations. The correlations are typically around 0.5 in the expected direction and significant. Only one study out of eight with a P value <0.01 in the original study was not replicable at the 5% level in the original direction.

Fig. 3 Correlations between P values and sample sizes in original studies and replicability indicators.

(A) The original P value is negatively correlated with all six replicability indicators, and five of these correlations are significant. (B) The original sample size is positively correlated with all six replicability indicators, and five of these correlations are significant. Spearman correlation coefficients are shown on the vertical axes. *P < 0.05; **P < 0.01.

We report the first systematic replications of laboratory experiments in economics, with the aim of contributing much-needed data to the larger question of the replicability of empirical findings in all areas of science. The results provide provisional answers to two questions: (i) Are laboratory experiments in economics generally replicable, and (ii) do statistical measures of research quality, including peer beliefs about replicability, help predict which studies will be replicable?

The provisional answer to the first question is that, based on this sample of experiments, replication is generally possible, although there is room for improvement. Eleven out of 18 (61.1%) studies were replicable with P < 0.05 in the original direction, and three more studies were relatively close to being replicated (all have significant effects in the meta-analysis). Four replications (22.2%) had effect sizes close to zero, somewhat more than the 1.4 replication failures expected by pure chance (given the mean power of 92%). Moreover, the original effect sizes in the studies that we replicated could have been inflated, a phenomenon that could stem from publication bias (28). If there is publication bias, our prospective power analyses will have overestimated the replication power.

The answer to the second question is that peer surveys and market beliefs did contain some information about which experiments were more likely to replicate, but sample sizes and P values in the original studies were even more strongly correlated with replicability (Fig. 3).

To learn from successes and failures in different scientific fields, it is useful to compare our results with recent results from studies of robustness in experimental psychology and empirical economics. Our results can be compared with the recent RPP project in the psychological sciences (19), which was also accompanied by prediction market beliefs and survey beliefs (26). All measures of replication success are somewhat higher for the economics experiments than for the sampled psychology experiments (Fig. 4). Peer beliefs in our study are also significantly higher than in the RPP study (Fig. 4). Acknowledging the limits of this two-study comparison, and particularly our small sample of 18 replications, there appears to be some difference in replication success between these fields. However, it is premature to draw strong conclusions about disciplinary differences; other methodological factors potentially could explain why the replication rates differed. For example, in the RPP replications, interaction effects were less likely to be replicable than main or simple effects (19).

Fig. 4 A comparison of replicability indicators in experimental economics (this study) and psychological sciences (RPP).

The graph shows means ± SE for replicability indicators. All six replicability indicators are higher for experimental economics; this difference is significant for three of the replicability indicators. The average difference in replicability across the six indicators is 19 percentage points. Details about the statistical tests are included in the supplementary materials. *P < 0.05; **P < 0.01.

In economics, several studies have shown that statistical findings from nonexperimental data are not always easy to replicate (29). Two studies of macroeconomic findings, reported in the Journal of Money, Credit and Banking in 1986 and 2006, respectively found that only 13% and 23% of original results were replicable, even when the data and code were easily accessible (30, 31). An analysis of 50,000 P values reported between 2005 and 2011 in three widely cited general economics journals found that P values between 0.10 and 0.25 were less common than might be expected. (32). However, the frequency of these “missing” P values is smaller in laboratory and field experiments. Taken together, these analyses and our replication sample suggest that laboratory experiments are at least as robust, and perhaps more robust, than other kinds of empirical economics.

Two methodological research practices in laboratory experimental economics may contribute to relatively high replication success. First, experimental economists have strong norms about motivating subjects with substantial financial incentives and avoiding the use of deception. These norms make subjects more responsive and may reduce variability in how experiments are performed across different research teams, thereby improving replicability. Second, pioneering experimental economists were eager for others to adopt their methods; to this end, they persuaded journals to print instructions and even original data. These editorial practices created norms of transparency and have made replication and reanalysis relatively easy.

There is every reason to be optimistic that science in general, and social science in particular, will emerge much improved after the current period of critical self-reflection. Our study suggests that laboratory experiments published in top economic journals have relatively high rates of replicability. Challenges still remain: For example, executing replications can be laborious, even when scientific journals require online posting of data and computer code to make things easier. This is a reminder that as scientists, we should design and document our methods to anticipate replication and make it easy to do. Our results also show that there is some information in post-publication peer beliefs (revealed in both markets and surveys), and perhaps even more information in simple statistics from published results, about whether studies are likely to be replicable. All of these developments suggest that the cultivation of good professional norms, discouragement of bad norms, policing of disclosure requirements by journals, and simple evidence-based editorial policies can improve scientific replicability, perhaps very quickly.

Supplementary Materials

www.sciencemag.org/content/351/6280/1433/suppl/DC1

Materials and Methods

Figs. S1 to S6

Tables S1 to S5

References (5166)

References and Notes

ACKNOWLEDGMENTS: For financial support, we thank the Austrian Science Fund (START grant Y617-G11), the Austrian National Bank (grant OeNB 14953), the Behavioral and Neuroeconomics Discovery Fund (grant to C.F.C.), the Jan Wallander and Tom Hedelius Foundation (grants P2015-0001:1 and P2013-0156:1), the Knut and Alice Wallenberg Foundation (Wallenberg Academy Fellows grant to A.D.), the Swedish Foundation for Humanities and Social Sciences (grant NHS14-1719:1), and the Sloan Foundation (grant G-2015-13929). We thank the following experimental laboratories for kindly allowing us to use them for replication experiments: the Center for Behavioral Economics at the National University of Singapore, the Center for Neuroeconomics Studies at Claremont Graduate University, the Frankfurt Laboratory for Experimental Economic Research, the Harvard Decision Science Laboratory, the Innsbruck EconLab, and the Nuffield Centre for Experimental Social Sciences. We thank the following persons for assistance with the experiments: J. Barraza, A. Berge, R. Bhui, A. Born, N. Cohodes, H. K. Dat, C. Dohmen, Z. Faiayd, M. Heissel, A. Henderson, G. Mansur, J. Preussler, L. Schultze, G. Thoelen, and E. Warner. The data reported in this paper are tabulated in tables S1, S3, and S4, and the replication reports, analyses code, and the data from the replications are available at www.experimentaleconreplications.com and at Open Science Framework (osf.io/bzm54). The authors report no potential conflicts of interest. No material transfer agreements, patents, or patent applications apply to methods or data in the paper. C.F.C., A.D., J.H., T.-H.H., M.J., and M.K. designed the research; C.F.C., A.D., E.F., J.H., T.-H.H., M.J., and M.K. wrote the paper; E.F., J.A., T.C., T.-H.H., and T.P. helped design the prediction market part; E.F., F.H., J.H., M.K., M.R., T.P., and H.W. analyzed data; A.A., E.H., F.H., T.I., S.I., G.N., M.R., and H.W. carried out the replications (including re-estimating the original estimate with the replication data); and all authors approved the final manuscript.
View Abstract

Subjects

Navigate This Article