News of the WeekDNA DATA

Proposal to 'Wikify' GenBank Meets Stiff Resistance

See allHide authors and affiliations

Science  21 Mar 2008:
Vol. 319, Issue 5870, pp. 1598-1599
DOI: 10.1126/science.319.5870.1598

When Thomas Bruns turns to GenBank, the U.S. public archive of sequence data, to identify a fungus based on its DNA sequence, he does so with some trepidation. As many as 20% of his queries return incorrect or incomplete information, says Bruns, a mycologist at the University of California, Berkeley. In a letter on page 1616, Bruns, Martin Bidartondo of Imperial College London, and 250 colleagues who work on fungi urge GenBank to allow researchers who discover inaccuracies in the database to append corrections. GenBank, however, says such a fix would cause more problems than it solves.

The letter comes from a relatively small research community concerned primarily with making sure that the species from which a sequence came is correctly identified. But “the problem extends far beyond fungi, to much bigger—and [more] recognizable—creatures,” says James Hanken, director of the Museum of Comparative Zoology at Harvard University. Other sorts of errors—such as inaccurate information on a gene's structure or on what its proteins do—also plague the database.

Incorrect data are more than just an inconvenience. Analyses of new data depend in a large part on comparisons with what's already in GenBank—be it right or wrong. Computers predict gene function, for example, based in part on similarities with known genes. And Bruns and others ferret out species' identities—often of organisms otherwise indistinguishable—by looking for matches to named GenBank entries. Under the current setup, “error propagation is all too likely,” says Thomas Kuyper, a mycologist at Wageningen University in the Netherlands.

Tangled mess.

The fungal threads (white fluff) on these pine roots require GenBank comparisons to identify.

CREDIT: THOMAS BRUNS

What the mycologists are asking for is a scheme like those used in herbaria and museums, where specimens often have multiple annotations: listing original and new entries side by side. It would be a community operation, like Wikipedia, in which the users themselves update and add information, but not anonymously.

Up and up.

Critics fear that GenBank's rapid growth is leading to error propagation.

CREDIT: GENBANK

GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. “That we would wholesale start changing people's records goes against our idea of an archive,” says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. “It would be chaos.”

The standoff over the quality of GenBank's data is in part a product of the database's success—and the issues are only going to get more intense. Researchers have been contributing genes, gene fragments, even whole genomes to GenBank since 1982, making it an incredibly valuable resource for many thousands of investigators worldwide. Today, GenBank provides 194.4 publicly accessible gigabases, a number that will double in 18 months, thanks in part to cheaper, faster sequencing technologies and a rise in “environmental” sequencing: mass sequencing of all the DNA in soil, skin, or other samples.

From early on, researchers recognized that errors would be inevitable (Science, 15 October 1999, p. 447), and although GenBank runs some quality-control checks on incoming sequences, it cannot catch many mistakes. GenBank has just one mycologist on staff, for example, but 150,000 fungal sequences were deposited this past year. “That's not something that a single person can curate,” says Lipman.

GenBank's creators consider the database a “library” of sequence records that, like books or journal articles, belong to the authors and therefore can be changed only by the submitters of that data. A note indicates when a record has been updated and points to the archived original. Stephen O'Brien, who does comparative genomics at the National Cancer Institute in Frederick, Maryland, argues that author privilege is necessary. “One of the reasons GenBank is so doggone useful and comprehensive is that nobody edits or micromanages it except the authors,” he says. “This makes for downstream errors but almost universal buy-in.”

Lipman says authors do take the time to make corrections. GenBank gets about 30 such messages a day, he points out. But others disagree, citing case after case in which problems were not fixed. Often the submitters have moved on to other projects and never get around to making the changes, says Steven Salzberg, a bioinformaticist at the University of Maryland, College Park. And, he adds, the big sequencing centers—which churn out genome after genome with preliminary annotation—are the worst offenders: “They won't let anybody touch their GenBank sequences, and they won't change it, for whatever reason.”

Lipman points out that other researchers improve on GenBank's data in a variety of ways. NCBI, for example, curates genes, along with other interesting DNA and RNA sequences, and puts them in a database called RefSeq that is updated as new information about these sequences comes along. And researchers focused on particular groups of organisms have set up their own secondary databases, such as FlyBase for the fruit fly genomes and TAIR for Arabidopsis, that offer cleaned-up GenBank data, along with other genomic information and tools for analyzing them. And, Lipman notes, NCBI even offers a way for researchers to do third-party annotation. But it's not the third-party annotation scheme the mycologists want.

For starters, GenBank has set a high bar for accepting changes: Entries must be backed by a publication. Annotations concerning a gene's function, for example, require published experimental data about that gene's protein or a related one. This discourages legitimate improvements, says Carol Bult, a geneticist at the Jackson Laboratory in Bar Harbor, Maine, because often a proposed correction doesn't justify an entire publication. Furthermore, an indication that additional annotation exists is deeply buried in the original sequence record.

Although he's adamant that NCBI is not going to “wikify” GenBank, Lipman says he's eager to work with mycologists to come up with a solution, possibly through RefSeq. Salzberg thinks NCBI will eventually come up with a way to maintain GenBank as an archive while allowing greater community involvement in annotation. “I think it will be solved eventually,” he says. “But it's not clear how it will be solved.”

View Abstract

Stay Connected to Science

Navigate This Article