Publication bias in the social sciences: Unlocking the file drawer

See allHide authors and affiliations

Science  19 Sep 2014:
Vol. 345, Issue 6203, pp. 1502-1505
DOI: 10.1126/science.1255484

The file drawer is full. Should we worry?

Experiments that produce null results face a higher barrier to publication than those that yield statistically significant differences. Whether this is a problem depends on how many null but otherwise valid results might be trapped in the file drawer. Franco et al. use a Time-sharing Experiments in the Social Sciences archive of nearly 250 peer-reviewed proposals of social science experiments conducted on nationally representative samples. They find that only 10 out of 48 null results were published, whereas 56 out of 91 studies with strongly significant results made it into a journal.

Science, this issue p. 1502


We studied publication bias in the social sciences by analyzing a known population of conducted studies—221 in total—in which there is a full accounting of what is published and unpublished. We leveraged Time-sharing Experiments in the Social Sciences (TESS), a National Science Foundation–sponsored program in which researchers propose survey-based experiments to be run on representative samples of American adults. Because TESS proposals undergo rigorous peer review, the studies in the sample all exceed a substantial quality threshold. Strong results are 40 percentage points more likely to be published than are null results and 60 percentage points more likely to be written up. We provide direct evidence of publication bias and identify the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.

Publication bias occurs when “publication of study results is based on the direction or significance of the findings” (1). One pernicious form of publication bias is the greater likelihood of statistically significant results being published than statistically insignificant results, holding fixed research quality. Selective reporting of scientific findings is often referred to as the “file drawer” problem (2). Such a selection process increases the likelihood that published results reflect type I errors rather than true population parameters, biasing effect sizes upwards. Further, it constrains efforts to assess the state of knowledge in a field or on a particular topic because null results are largely unobservable to the scholarly community.

Publication bias has been documented in various disciplines within the biomedical (39) and social sciences (1017). One common method of detecting publication bias is to replicate a meta-analysis with and without unpublished literature (18). This approach is limited because much of what is unpublished is unobserved. Other methods solely examine the published literature and rely on assumptions about the distribution of unpublished research by, for example, comparing the precision and magnitude of effect sizes among a group of studies. In the presence of publication bias, smaller studies report larger effects in order to exceed arbitrary statistical significance thresholds (19, 20). However, these visualization-based approaches are sensitive to using different measures of precision (21, 22) and also assume that outcome variables and effect sizes are comparable across studies (23). Last, methods that compare published studies to “gray” literatures (such as dissertations, working papers, conference papers, or human subjects registries) may confound strength of results with research quality (7). These techniques are also unable to determine whether publication bias occurs at the editorial stage or during the writing stage. Editors and reviewers may prefer statistically significant results and reject sound studies that fail to reject the null hypothesis. Anticipating this, authors may not write up and submit papers that have null findings. Or, authors may have their own preferences to not pursue the publication of null results.

A different approach involves examining the publication outcomes of a cohort of studies, either prospectively or retrospectively (24, 25). Analyses of clinical registries and abstracts submitted to medical conferences consistently find little to no editorial bias against studies with null findings (2631). Instead, failure to publish appears to be most strongly related to authors’ perceptions that negative or null results are uninteresting and not worthy of further analysis or publication (3235). One analysis of all institutional review board–approved studies at a single university over 2 years found that a majority of conducted research was never submitted for publication or peer review (36).

Surprisingly, similar cohort analyses are much rarer in the social sciences. There are two main reasons for this lacuna. First, there is no process in the social sciences of preregistering studies comparable with the clinical trials registry in the biomedical sciences. Second, even if some unpublished studies could be identified, there are likely to be substantial quality differences between published and unpublished studies that make them difficult to compare. As noted, previous research attempted to identify unpublished results by examining conference papers and dissertations (37) and human subjects registries of single institutions (36). However, such techniques may produce unrepresentative samples of unpublished research, and the strength of the results may be confounded with research quality. Conference papers, for example, do not undergo a similar process of peer review as do journal articles in the social sciences and therefore cannot be used as a comparison set. This work is distinctive in the study of publication bias in the social sciences in that we analyzed a known population of conducted studies, and all studies in the population exceed a substantial quality threshold.

We leveraged TESS (Time-sharing Experiments in the Social Sciences), a National Science Foundation–sponsored program established in 2002 in which researchers propose survey-based experiments to be run on nationally representative samples. These experiments typically embed some randomized manipulation (such as visual stimulus or question wording difference) within a survey questionnaire. Researchers apply to TESS, which then submits the proposals to peer review and distributes grants on a competitive basis (38). Our basic approach is to compare the statistical results of TESS experiments that eventually got published with the results of those that remain unpublished.

This analytic strategy has many advantages. First, we have a known population of conducted studies and therefore have a full accounting of what is published and unpublished. Second, TESS proposals undergo rigorous peer review, meaning that even unpublished studies exceed a substantial quality threshold before they are conducted. Third, nearly all of the survey experiments were conducted by the same high-quality survey research firm (Knowledge Networks, now known as GfK Custom Research), which assembles probability samples of Internet panelists by recruiting participants via random digit dialing and address-based sampling. Thus, there is remarkable similarity across studies with respect to how they were administered, allowing for comparability. Fourth, TESS requires that studies have requisite statistical power, meaning that the failure to obtain statistically significant results is not simply due to insufficient sample size.

One potential concern is that TESS studies may be unrepresentative of social science research, especially scholarship based on nonexperimental data. Although TESS studies are clearly not a random sample of the research conducted in the social sciences, it is unlikely that publication bias is less severe than what is reported here. The baseline probability of publishing experimental findings based on representative samples is likely higher than that of observational studies using “off-the-shelf” data sets or experiments conducted on convenience samples in which there is lower “sunk cost” involved in obtaining the data. Because the TESS data were collected at considerable expense—in terms of time to obtain the grant—authors should, if anything, be more motivated to attempt to publish null results.

The initial sample consisted of the entire online archive of TESS studies as of 1 January 2014 (39). We analyzed studies conducted between 2002 and 2012. We did not track studies conducted in 2013 because there had not been enough time for the authors to analyze the data and proceed through the publication process. The 249 studies represent a wide range of social science disciplines (Table 1). Our analysis was restricted to 221 studies, 89% of the initial sample. We excluded seven studies published in book chapters and 21 studies for which we were unable to determine the publication status and/or the strength of experimental findings (40). The full sample of studies is presented in Table 2; the bolded entries represent the analyzed subsample of studies.

Table 1 Distribution of studies across years and disciplines.

Field coded based on the affiliation of the first author. “Other” category includes Business, Computer Science, Criminology, Education, Environmental Studies, Journalism, Law, and Survey Methodology.

View this table:
Table 2 Cross-tabulation between statistical results of TESS studies and their publication status.

Entries are counts of studies by publication status and results. Bolded entries indicate observations included in the final sample for analysis (40). Results are robust to the inclusion of book chapters (table S7).

View this table:

The outcome of interest is the publication status of each TESS experiment. We took numerous approaches to determine whether the results from each TESS experiment appeared in a peer-reviewed journal, book, or book chapter. We first conducted a thorough online search for published and unpublished manuscripts and read every manuscript in order to verify that it relied on data collected through TESS and that it reported experimental results (40). We then e-mailed the authors of more than 100 studies for which we were unable to find any trace of the study and asked what happened to their studies. We also asked authors who did not provide a publication or working paper to summarize the results of their experiments.

The outcome variable distinguishes between two types of unpublished experiments: those prepared for submission to a conference or journal, and those never written up in the first place. It is also possible that papers with null results may be excluded from the very top journals but still find their way into the published literature. Thus, we disaggregated published experiments according to their placement in top-tier or non–top-tier journals (40) (a list of journal classifications is provided in table S1). The results from the majority of TESS studies in our analysis sample have been written up (80%), whereas less than half (48%) have been published in academic journals.

We also ascertained whether the results of each experiment are described as statistically significant by their authors. We did not analyze the data ourselves to determine whether the findings were statistically significant for two main reasons. First, it is often very difficult to discern the exact analyses the researchers intended. The proposals that authors submit to TESS are not a matter of public record, and many experiments have complex experimental designs with numerous treatment conditions, outcome variables, and moderators. Second, what is most important is whether the authors themselves consider their results to be statistically significant because this influences how they present their results to editors and reviewers, as well as whether they decide to write a paper. Studies were classified into three categories of results: strong (all or most of hypotheses were supported by the statistical tests), null (all or most hypotheses were not supported), and mixed (remainder of studies) (40). Approximately 41% of the studies in our analysis sample reported strong evidence in favor of the stated hypotheses, 37% reported mixed results, and 22% reported null results.

There is a strong relationship between the results of a study and whether it was published, a pattern indicative of publication bias. The main findings are presented in Table 3, which is a cross-tabulation of publication status against strength of results. A Pearson χ2 test of independence is easily rejected [χ2(6) = 80.3, P < 0.001], implying that there are clear differences in the statistical results between published and unpublished studies. Although around half of the total studies in our sample were published, only 20% of those with null results appeared in print. In contrast, ~60% of studies with strong results and 50% of those with mixed results were published. Although more than 20% of the studies in our sample had null findings, <10% of published articles based on TESS experiments report such results. Although the direction of these results may not be surprising, the observed magnitude (an ~40 percentage point increase in the probability of publication from moving from null to strong results) is remarkably large.

Table 3 Cross-tabulation between statistical results of TESS studies and their publication status (column percentages reported).

Pearson χ2 test of independence: χ2 (6) = 80.3, P < 0.001.

View this table:

However, what is perhaps most striking in Table 3 is not that so few null results are published, but that so many of them are never even written up (65%). The failure to write up null results is problematic for two reasons. First, researchers might be wasting effort and resources in conducting studies that have already been executed in which the treatments were not efficacious. Second, and more troubling, if future researchers conduct similar studies and obtain statistically significant results by chance, then the published literature on the topic will erroneously suggest stronger effects. Hence, even if null results are characterized by treatments that “did not work” and strong results are characterized by efficacious treatments, authors’ failures to write up null findings still adversely affects the universe of knowledge. Once we condition on studies that were written up, there is no statistically significant relationship between strength of results and publication status (table S2).

A series of additional analyses demonstrate the robustness of our results. Estimates from multinomial probit regression models show that studies with null findings are statistically significantly less likely to be written up even after controlling for researcher quality (using the highest-quality researcher’s cumulative h-index and number of publications at the time the study was run), discipline of the lead author, and the date the study was conducted (supplementary text and table S3). Further, the relationship between strength of results and publication status does not vary across levels of these covariates (supplementary text and tables S4 and S5). Another potential concern is that our coding of the statistical strength of results is based on author self-reports, introducing the possibility of measurement error and misclassification. A sensitivity analysis shows that our findings are robust to even dramatic and unrealistic rates of misclassification (supplementary text and fig. S1).

Why do some researchers choose not to write up null results? To provide some initial explanations, we classified 26 detailed e-mail responses we received from researchers whose studies yielded null results and did not write a paper (table S6). Fifteen of these authors reported that they abandoned the project because they believed that null results have no publication potential even if they found the results interesting personally (for example, “I think this is an interesting null finding, but given the discipline’s strong preference for P < 0.05, I haven’t moved forward with it”). Nine of these authors reacted to null findings by reducing the priority of writing up the TESS study and focusing on other projects (for example, “There was no paper unfortunately. There still may be in future. The findings were pretty inconclusive”). Two authors whose studies “didn’t work out” eventually published papers supporting their initial hypotheses using findings obtained from smaller convenience samples.

How can the social science community combat publication bias of this sort? On the basis of communications with the authors of many experiments that resulted in null findings, we found that some researchers anticipate the rejection of such papers but also that many of them simply lose interest in “unsuccessful” projects. These findings show that a vital part of developing institutional solutions to improve scientific transparency would be to understand better the motivations of researchers who choose to pursue projects as a function of results.

Few null findings ever make it to the review process. Hence, proposed solutions such as two-stage review (the first stage for the design and the second for the results), pre-analysis plans (41), and requirements to preregister studies (16) should be complemented by incentives not to bury statistically insignificant results in file drawers. Creating high-status publication outlets for these studies could provide such incentives. The movement toward open-access journals may provide space for such articles. Further, the pre-analysis plans and registries themselves will increase researcher access to null results. Alternatively, funding agencies could impose costs on investigators who do not write up the results of funded studies. Last, resources should be deployed for replications of published studies if they are unrepresentative of conducted studies and more likely to report large effects.

Supplementary Materials

Materials and Methods

Supplementary Text

Fig. S1

Tables S1 to S7

Reference (42)

References and Notes

  1. The rate at which research-initiated proposals are approved by the peer reviewers engaged by TESS is provided in the supplementary materials.
  2. Materials and methods are available as supplementary materials on Science Online.
  3. Acknowledgments: Data and replication code are available on GitHub (DOI: 10.5281/zenodo.11300). All authors contributed equally to all aspects of the research. No funding was required for this article. The authors declare no conflicts of interest. We thank seminar participants at the 2014 Annual Meeting of the Midwest Political Science Association, the 2014 Annual Meeting of the Society for Political Methodology, the 2014 West Coast Experiments Conference, Stanford University, and University of California, San Diego. We thank C. McConnell and S. Liu for valuable research assistance.
View Abstract

Navigate This Article