Selective Dissemination and Indexing of Scientific Information

See allHide authors and affiliations

Science  23 Jul 1971:
Vol. 173, Issue 3994, pp. 300-308
DOI: 10.1126/science.173.3994.300


Selective dissemination of information to individuals provides a new and promising method for keeping abreast of current scientific information. Since SDI services are directed to the information needs of each individual, they are a significant step beyond grouporiented services and products, which require considerable expenditure of effort by each user as he sorts useful information from trash. However, SDI systems do require a high degree of precision in matching scientists against documents. They must operate more efficiently and economically than many current systems which occasionally provide a useful item of information to users. To meet these stringent requirements for quality, precision, efficiency, and economy, more research must be devoted to comparing and improving indexing methods, which are the basic component of all information storage and retrieval systems.

It is incredible that so much money has been spent on the development and operation of scientific information systems before basic data on the comparative performance of various indexing methods have been gathered, analyzed, and confirmed by multiple investigators. The design of an effective information system would seem to require this type of basic knowledge, just as basic properties of alternative materials must be known before an engineer can design a building, bridge, or factory. Yet, except for the few studies mentioned in the previous section, research on indexing methods has been greatly neglected. Bourne's comment about studies of indexing languages is still an appropriate description of the situation: "In almost all the experimental reports, the investigator worked with an indexing language different than that of other experimenters. Consequently, no one has ever had his test results verified, or expanded, or made more precise by another experimenter" (47).

Most existing information systems are based on keyword indexing, with concepts broken into isolated terms during input operations and recombined to synthesize the original concept during search and retrieval. Such systems tend to involve imprecise indexing, with a high level of "noise" in retrieved documents, difficult search strategy involving extensive post-coordination, and lengthy, complex computer manipulations. This situation reflects the fact that many producers of indexed data originally focused the design of their systems on the production of a published product with entries printed under short, concise index headings. Production of magnetic tapes as a by-product of the publication process, and their use for retrospective searching or for SDI services, was a much later development, almost an afterthought. Yet use of these tapes is growing so rapidly that it may be time to redesign the tape-producing systems, with ease of tape use for SDI services and retrospective searching as the primary consideration, and with publication of abstract and index bulletins or title listings relegated to secondary importance (49).

The use of keywords to index documents creates a high degree of disorganization in information search and retrieval operations: Information is scattered under the many different terms that can be used to index different aspects of a concept. If the large-scale, comprehensive abstracting and indexing services were based on enumerative classifications with assignment of documents to logical hierarchical categories at the time of initial indexing, then many of the specialized information centers (50) and the 1300 abstracting and indexing services (3) would be unnecessary, and much of the reindexing and reprocessing of documents, the repackaging and reworking of abstracts and index data, and the resulting overlap and duplication characteristic of current information processing could be terminated.

Partly because of the disorganization resulting from keyword indexing, the cost of a 5-year retrospective search of information on just one data base on magnetic tapes is a major investment (16). The effort and cost required to find a few items of useful information scattered among 1,285,000 abstracts indexed on 116 full reels of magnetic tape (11 million characters per reel) which will be needed for the 5-year Eighth Collective Index to Chemical Abstracts (1967-1971) (51) staggers the imagination.

In contrast, when HICLASS systems based on enumerative hierarchical classifications are used, concepts that might be useful for later retrieval are identified and related items of information are grouped together during the indexing process. These enumerative classifications, with single-hit matching, make it possible to index and retrieve ideas as intact units and to perform simple sequential searches of the very small segment of a file that deals with a given topic (31). The experiments at both the Science Information Exchange and the National Cancer Institute, as described in this article, demonstrate that automated HICLASS systems are feasible and can operate at a very satisfactory level of performance.

Although considerable effort may be required for the development and constant updating of detailed enumerative classifications, HICLASS categories may facilitate organization of data at the time of input, improve the precision of matching documents with users, and greatly simplify search logic and computer manipulations. If so, then output savings and performance would more than justify input costs, and the development and use of enumerative classifications would be a better solution to information problems than the current keyword-and-coordination approach.

It is time to think beyond the ease of the single input step in information systems and to take a hard look at ways of easing retrieval problems for the multitude of information systems that process the indexed data (52). Indexing effort is expended only once, whereas search and retrieval effort is required by every user of a system. If information were better analyzed and organized during input operations, if more basic research were devoted to the effect of indexing methods on the performance of information systems, and if more emphasis were placed on the quality and usefulness of retrieved information, then the magnitude of problems related to the storage and retrieval of scientific information might be considerably reduced.