Staying Afloat on the Seas of Data

Science  11 Dec 1998:
Vol. 282, Issue 5396, pp. 1989
DOI: 10.1126/science.282.5396.1989

This issue of Science contains a milestone in the life sciences: the first complete genome of a multicellular eukaryote, the nematode Caenorhabditis elegans. Those who worked so exhaustively to collect this mass of sequence information deserve the praise of the entire scientific community. Not only did they evolve an effective strategy for large genome analysis with efficiency and accuracy, they also assembled the data in an organized manner that will now permit many other scientists to enjoy and extend the fruits of their labor. And determining the sequence of this genome is only the first stage of a heroic endeavor to understand the meaning of the genes and the interactions of their gene products. Truly, these researchers have generated a massive sea of data.

However, scientists regard data with mixed emotions. They avidly collect, validate, and interpret their own observations to produce the scientific publications on which their careers will be built. Yet for overloaded scientists focused on their own work, genome-sized bursts of data might be skeptically viewed as inaccessible or too massive to study usefully. In fact, this view is false. Providing the data in an organized form suitable for examination by the powerful search, alignment, and similarity-seeking tools that have been made publicly available means that these data can now be examined by any scholar.

The databases that have accrued through the several microbial genome projects as well as the yeast and the C. elegans genomes now begin to offer ways to look at other whole genomes from other species that are still quite incomplete in order to look for conservation of the most well-defined gene products. All of these genome databases can synergize constructively with intermediate scale databases on types of molecular function (receptors, transductive mechanisms, DNA transcriptional regulators, and classes of enzymes) or structural motifs representing consensus protein functions. Whole-genome databases are a major advance over the molecule-by-molecule methods that expanded the DNA and protein databases over the past quarter-century. The next decade will bring a host of other complete genome sequences from other organisms and from humans. In short, the scientific community can now not only gain sequence precision from the completion of the C. elegans genome but they can also provide data of their own to hasten the task of providing functional annotations that will clarify the role of these gene products in the physiology of the nematode.

Moreover, as important as these whole-genome efforts will be, other technological advances for rapidly analyzing whole-organism gene expression, such as the DNA chip array, will likely provide still more enormous data sets. Comprehensive determination of the genetic responses of specified cells under defined experimental challenges will likely reveal new regulatory systems, especially as tools emerge to refine analysis of their temporal, spatial, and quantitative differences across cells in other organs and in other species.

The growth of databases poses many challenges for scholars and journals. Traditionally, scientific databases have been both public and private. Although many issues of standardization remain to be resolved for the multiple public databases, corrections and annotations require that all databases be constantly updated. Should databases from which observations have been extracted for a scientific report be frozen and made accessible for others to examine for confirmation or alternative insights? Should they be maintained by the scholar or the journal, and for how long and in what form? Would such data sets be an archived laundry list of information or could they provide a new dimension to guide future experimentation?

Science believes that when an experimental database provides a generalized understanding for a broad community and is crucial to the evaluation of insights presented in a paper, it will benefit scholarship for that database to be reviewed, publicly archived, and made accessible in easily imported file structures. Moreover, we look forward to the challenge of reviewing, evaluating, and properly presenting papers based on such large data sets, and to the extended insights that the community may then be able to glean from accessing them. We await your submittals.

Related Content

Navigate This Article