Technical Comments

Dealing with Database Explosion: A Cautionary Note

See allHide authors and affiliations

Science  13 Jun 1997:
Vol. 276, Issue 5319, pp. 1724-1725
DOI: 10.1126/science.276.5319.1724

Carol J. Bult et al. (1) report the first entire archea genome sequence of Methanococcus jannaschii (Mja). Because the initial gene assignments were conservative (1, 2), we anticipated that much interesting biological information would be missing. We searched the database for additional open reading frames (ORFs), and found 15 ORFs: four within intergenic regions (M1 through M4, Table 1); five overlapping with previously identified ORFs (1, 2) but that read off in a different frame (M5 through M9, Table 1); and six that are extended or truncated as a result of potential frameshifts (M10 through M15, Table 2).

Table 1

New ORFs in M. jannaschii (Mja) identified on the basis of similarity. ORFs were identified after purging out protein coding regions reported for the organism (1) and searched using BLASTX against the combined SwissProt+PIR+Genbank translations database through the NCBI Network BLAST server using a score cutoff of 60, as described previously (6). Corresponding matching protein, matching species–Methanococcus vannielii (Mva), Bacillus subtilis (Bsu), Haemophilus influenzae (Hin)—5′ start position, + or − strand, length of the ORF in amino acids (AA), 5′ to 3′ flanking ORFs, and the Poisson probability estimates are provided for each ORF. Other details available at

View this table:
Table 2

Identification of potential frameshift(s) by similarity. Highly significant BLAST matches, of similar genes in alternative coding frames, were classified as frameshifts, manually assembled, and confirmed. Effect of the frameshift (extension or truncation) and length of the ORF as a result of the frameshift are also provided. M10 through M14 have suffered a single frameshift event, while M15 has apparently undergone a second frameshift. Other details available at

View this table:

Although the potential frameshifts we describe might be bona fide, it cannot be ruled out that they represent actual sequencing artifacts. Erroneous sequences in public databases are a substantial problem and have been estimated to be in the range of 0.37 to 2.9 errors per 1000 nucleotides (3), making data interpretation sometimes difficult. This is especially true, for example, in studies that utilize protein and DNA sequence information to estimate evolutionary distances (4). It is not known how the error rate in this study (1) compares with error rates in the database, but a previous study suggests that error rates generally vary between 1 in 5000 to 1 in 10,000 nucleotides (5).

The issue of sequencing artifacts is important and is expected to be a continuing problem in the future, considering the heightened surge of genome sequencing projects from model organisms, as well as from the human genome sequencing initiative.


Response: Bhatia et al. express concerns about erroneous sequences in public databases, which make the interpretation of sequence data sometimes difficult. We share these concerns because faulty entries in public databases, especially sequence annotations, often complicate our research efforts. Therefore, we dedicate considerable resources to maintain a curated in-house database and to carefully check the sequences and annotations provided by us to the public. The challenge is to find a suitable compromise between the quick release of newly sequenced genomes and responsible sequence quality and annotation. We estimate our error rate at the time of release to be 1 base in 5000 to 10,000 (1), which is about the quality requested for the Human Genome Project. For the 1.7-Mbp M. jannaschii genome (1), this would account for about 250 putative errors, which would mainly result in frameshifts in ORFs that as yet have no recognizable homologs in any database. Bhatia et al. specify 15 regions in this genome where they suspect ORFs or frameshift problems resulting from sequencing artifacts. We encourage the input of the scientific community in ongoing efforts to further elucidate the wealth of biological information still hidden in this genome; however, without access to the original electropherograms that were used to generate the final genome sequence data, it is not always possible to definitively determine whether a presumed frameshift reflects an error in the DNA sequence or not (Table 1). For example, ORF M11 in table 2 of the comment suggests that we truncated a transposase gene by a frameshift, but this “ORF” is a vestigial gene that is missing a significant portion of the central part of its homologues. The nucleotide necessary for a correction of the frameshift, A-276,294, is absent in all 12 sequences covering this area of the genome.

No automated computer system will discover some of the treasures (and some of the errors) still hidden in the genome ofM. jannaschii. We are therefore grateful to colleagues who, after the release of the M. jannaschii genome sequence, contacted us to provide their biological, biochemical, and genetic experience and expertise, which has resulted in quick updates and corrections of our freely accessible database at


Stay Connected to Science


Navigate This Article