In DepthCOVID-19

New tools aim to tame pandemic paper tsunami

See allHide authors and affiliations

Science  29 May 2020:
Vol. 368, Issue 6494, pp. 924-925
DOI: 10.1126/science.368.6494.924

Science's COVID-19 coverage is supported by the Pulitzer Center.

Embedded Image

Timothy Sheahan, a virologist studying COVID-19, wishes he could keep pace with the growing torrent of new scientific papers related to the pandemic. But there have just been too many—more than 5000 papers a week. “I'm not keeping up,” says Sheahan, who works at the University of North Carolina, Chapel Hill. “It's impossible.”

A loose-knit army of data scientists and software developers is pressing hard to change that. They are creating digital collections of papers and building search tools powered by artificial intelligence (AI) that could help researchers quickly find the information they seek. The urgency is growing: The COVID-19 literature has grown to more than 31,000 papers since January and by one estimate is on pace to hit more than 52,000 by mid-June—among the biggest explosions of scientific literature ever.

The volume of information “is like what you would get in a medical conference that used to happen yearly. Now, that's happening daily,” says Sherry Chou, a neurologist at the University of Pittsburgh Medical Center who is studying COVID-19's neurologic effects.

“People don't have time to read through entire articles and figure out what is the value added … and what are the limitations,” says Kate Grabowski, an epidemiologist at Johns Hopkins University's School of Medicine who leads an effort to create a curated set of pandemic papers.

It's not clear, however, whether the emerging efforts will tame the tsunami. Despite a global effort to persuade publishers to make all papers relevant to COVID-19 immediately free, as many as 20% of new papers are still behind paywalls, a recent study found, off-limits to some readers and AI analysis. Some of the new search tools, meanwhile, aren't very user-friendly or are little known. And many researchers are skeptical that the tools will tell them what they really want to know: What is the work's quality? “People tend to oversell and put up papers with data that do not support their conclusions,” Sheahan says. “It's a mess.”

One line of work got a boost on 16 March, when the White House Office of Science and Technology Policy announced the launch of the COVID-19 Open Research Dataset (CORD-19), a trove that now includes more than 128,000 peer-reviewed articles and preprints, including studies of virology and coronaviruses dating back decades. To create the archive, some of the largest groups active in machine learning—including Google, the Chan Zuckerberg Initiative, and the Allen Institute for AI—collaborated with the National Institutes of Health and others to use search methods, such as natural language processing, to scan the scientific literature for relevant terms. The team also converted PDF files into a form readable by machine learning algorithms so other researchers could analyze the papers.

CORD-19's creation was “amazing work,” says Giovanni Colavizza, a bibliometrics researcher at the University of Amsterdam. But analyses he and colleagues conducted have found potential shortcomings. For example, as of 17 April, about 75% of the papers don't mention search terms used by CORD-19's creators, such as “coronavirus,” in their titles, abstracts, or keywords, the researchers reported in a preprint posted on bioRxiv. That means these articles might only be tangentially related to COVID-19, he says. What's more, fewer than half the papers provided full text, necessary for comprehensive data mining by AI programs.

A growing number of papers are also not freely available to human readers. In response to calls from major science funders, including some governments, most major publishers have pledged to make free their COVID-19–related papers. But the number of paywalled publications is growing faster than the free ones, according to a study led by Nicolas Robinson-Garcia of the Delft University of Technology and posted as a preprint on 26 April on bioRxiv. By 1 June, nearly half of all COVID-19 papers could be behind paywalls, the researchers estimate, which also limits data mining and AI-enhanced searching.

Despite these limitations, many teams are turning to advanced computational tools to mine databases such as CORD-19. Data scientists, for example, have launched more than 1500 projects in response to a White House call to build tools that use CORD-19 to help answer 10 high-priority, pandemic-related research questions identified by the U.S. National Academy of Sciences and the World Health Organization.

One fruit of these efforts, which are listed on the Kaggle online hub, is an “AI-powered literature review.” It used algorithms to harvest data points from more than 830 papers in CORD-19 on 17 topics, and presents a web page for each topic that displays data tables and links to more information. But the algorithms don't always extract the data correctly, so medical students and other volunteers idled by the pandemic have been manually checking the tool's accuracy.

Another challenge is making some tools user friendly. A team at the Allen Institute for AI recently unveiled SciSight, which helps those searching the CORD-19 database by automatically suggesting similar papers and drawing browsable maps of related papers.

Grabowski's team at Johns Hopkins decided to emphasize human judgment over automated approaches. To create their 2019 Novel Coronavirus Research Compendium, which debuted on 17 April, more than 50 are combing through the literature. So far, they have selected and summarized more than 120 papers on eight topics, including vaccines and treatments. The team excluded most of the articles it examined because they only contained commentaries, protocols, poor-quality models, or no original findings, Grabowski says. The effort is focused on studies in humans, and the intended readers include health care workers and policymakers, Grabowski says. “We are trying to fill a void that we saw. … [T]here is just so much information, but a lot of the studies are not conducted very well.”

It's too soon to measure the quality of pandemic papers based on citations or retractions, specialists say. But if quality is suffering, it's not because, as often feared, large numbers of studies are bypassing peer review and appearing only as preprints. Preprints only make up a minority of the COVID-19 gusher, according to Robinson-Garcia's team. As of 14 April, some 80% of the more than 11,000 COVID-19 manuscripts it examined had appeared in refereed journals. (Some were preprints originally.)

Despite the many efforts to help them navigate the literature, Sheahan and other scientists say they have not heard of new tools released in recent weeks or have had little time to try them. And persuading researchers to adopt them could be difficult, says Jevin West, a data scientist at the University of Washington, Seattle. “It's going to take some time to get people to change their habits,” he says. Making that pitch during a pandemic is “like going into an emergency room and giving the doctors a different scalpel and saying: ‘This is actually better.’”

In the meantime, many researchers say they are falling back on time-tested ways to keep up with new results, including reading bulletins from scientific societies and a few leading journals, as well as relying on word of mouth—including tweets—from trusted colleagues.

Stay Connected to Science


Navigate This Article