Report

A Physical Map of 30,000 Human Genes

See allHide authors and affiliations

Science  23 Oct 1998:
Vol. 282, Issue 5389, pp. 744-746
DOI: 10.1126/science.282.5389.744

Abstract

A map of 30,181 human gene–based markers was assembled and integrated with the current genetic map by radiation hybrid mapping. The new gene map contains nearly twice as many genes as the previous release, includes most genes that encode proteins of known function, and is twofold to threefold more accurate than the previous version. A redesigned, more informative and functional World Wide Web site (www.ncbi.nlm.nih.gov/genemap) provides the mapping information and associated data and annotations. This resource constitutes an important infrastructure and tool for the study of complex genetic traits, the positional cloning of disease genes, the cross-referencing of mammalian genomes, and validated human transcribed sequences for large-scale studies of gene expression.

The ultimate gene map for an organism is the complete sequence of its genome, annotated with the beginning and ending coordinates of every gene. Construction of such sequence maps has become routine for simpler organisms with relatively small genome sizes (for example, 1 to 20 Mb), and public databases now contain 18 examples of such complete genomic sequences (1). For more complex organisms, such as mice and humans, with genome sizes in the 3-Gb range, complete and accurate genome sequences are still 5 to 10 years away (2, 3). However, large quantities of preliminary data (“shotgun assemblies”) are already available (4) and expected to grow rapidly (5). Both of these factors necessitate the construction of gene maps to support basic and applied research in mammalian biology and medicine, as well to aid in the analysis and interpretation of “unfinished” genome sequence data. Extensive libraries of expressed gene sequences (6, 7), combined with physical mapping with radiation hybrid (RH) panels (8–10), have provided the information, infrastructure, and technology to produce such maps in an efficient and economical manner.

In 1994, an international consortium was formed to construct a human gene map in which cDNA-based sequence-tagged site (STS) markers were physically mapped and then integrated with the genetic map of polymorphic microsatellite markers (11). The initial report of this consortium in 1996 described a map of ∼16,000 genes (12). A new map, reported here, represents a nearly 100% increase in gene density and map accuracy and may contain up to half of all human protein-coding genes. This map should be a valuable resource for the positional candidate cloning of complex (polygenic) disease loci, the construction of complete physical maps of chromosomes for genome sequencing, and comparative analysis of mammalian chromosome structure and evolution. Furthermore, sequence validation that occurs in the process of STS design and mapping creates a quality-assured gene sequence resource for “functional genomics” applications (13) such as the design and construction of large-scale gene expression arrays.

This new gene map consists of data from 41,664 STSs (Table 1). As in the previous map (12), they are based on 3′ untranslated regions of cDNAs. These STSs represent 30,181 unique genes. Markers were typed on the Genebridge4 (GB4) RH panel (39,886 cDNAs, 1641 microsatellite markers, and 13 telomeric markers), on the G3 RH panel (5013 cDNAs and 2091 microsatellites), or on both panels (1102 microsatellites). All GB4 data (Table 1) were, for the first time, merged into a single map and aligned with the G3 RH map and the genetic map (11) with the 1102 microsatellite markers that are common to all three maps. The integrated map is available atwww.ncbi.nlm.nih.gov/genemap. In addition, two Web servers [one for each RH panel (14)] permit anyone to map a new marker relative to this map.

Table 1

Number of markers in the current and previous gene maps by contributor.

View this table:

This new map is twofold to threefold more accurate than the 1996 gene map by several criteria. Some markers were mapped in duplicate to make it possible to detect discrepancies between independent experimental results. The error rate in assignment of the same marker to different chromosomes was 0.52% (compared with 1% in the 1996 map). The error rate in chromosome assignment was also assessed, with the e-PCR program (15), by matching STSs to 122 Mb of human genomic sequence data present in GenBank as of April 1998. Twenty-three of the 2134 STSs tested matched a genomic sequence from a different chromosome from that determined by the RH mapping, corresponding to an error rate of 1.08% (3.76% for the 1996 map). To assess the accuracy of marker placements along the chromosomes, we converted positions of markers on the GB4 and G3 maps (in cR3000 and cR10000 coordinates, respectively) to centimorgan (cM) coordinates on the genetic map for direct comparison. Only 1.35% of markers were discrepant by 10 cM or more (2.5% for the 1996 map), and 1.78% of markers had positions differing by 5 cM. This substantial improvement in quality of the new map is due primarily to retyping or removal (or both) of markers suspected of being in error on the basis of analysis of the previous map.

The chromosomal distribution of 30,075 distinct gene-based markers (excluding those with conflicting chromosome assignments) is given in Table 2. The ratio of observed versus expected genes per chromosome [based on the physical length of each chromosome (16)] indicates a significantly higher gene density for chromosomes 1, 11, 17, 19, and 22 and a significantly lower gene density for chromosomes 4, 5, 8, 13, 18, and X.

Table 2

Chromosome distribution of distinct gene-based STS markers.

View this table:

The total number of human genes has been estimated at 60,000 to 70,000 (17). Therefore, this map contains transcript markers approaching half of all human genes. The map includes 18,703 of the 46,045 entries in UniGene (7) and 4684 (78%) of the about 6000 human genes of known function (18). Work is continuing not only to map all remaining unmapped cDNAs but also to redevelop and retype markers for cDNAs that failed initial mapping attempts. New efforts by the community to convert expressed sequence tags (ESTs) into more accurate and complete cDNA clones and sequences (3) will aid this process enormously.

The main practical value of having a dense and integrated genetic-physical map of genes is to accelerate the discovery, by positional and positional candidate cloning (19), of human disease genes. In the calendar year after publication of the 16,000-gene map (12), isolation of 16 genes by positional approaches was reported (20). Retrospective analysis shows that 44% (7 out of 16) of these genes had already been isolated as ESTs and mapped at the time of their cloning. This fraction increases to 69% (11 out of 16) when the data from the current map are considered.

Comparative analysis has a long and fruitful history in biology, and detailed comparative maps of mammalian genomes have shed light on chromosome evolution. The identification and cross-referencing of genes allow insights into similarities and differences of physiology and development as well as candidates for transgenesis and gene knockout experiments. Thus, it was of interest to determine the extent to which genes on the current human map could be related to orthologous genes in other mammals. Makalowski and Boguski (21) have assembled a set of 1880 human genes along with their rat or mouse (or both) orthologs. When these genes were analyzed for overlap with the 30,181 mapped human genes in the current study, we found that 70% of these human genes with rodent counterparts are present. This data set therefore provides an excellent index for cross-referencing the human map with emerging gene-based physical maps of the mouse and rat genomes (22).

Genome-scale expression monitoring or profiling (23), a rapidly expanding area of functional genomics, relies on the availability of large catalogs of cDNA sequences or arrays of clones (or both). The problems posed by sequence redundancy and inaccuracy are as critical for gene expression applications as they have been for transcript mapping. Furthermore, additional problems in these catalogs have become apparent, necessitating the authentication of sequences and clone reagents. Our collection of nearly 42,000 successfully mapped, gene-based STSs, representing ∼30,000 unique human transcripts, provides a large, validated set of human sequences that can be used to design gene-specific oligonucleotides or select cDNA- derived polymerase chain reaction products for populating gene expression arrays (or both). Use of this set could lead to a very useful confluence of mapping and expression information for human genes.

We have produced a map containing perhaps half of all human genes. In the future, this map and subsequent versions will ultimately be replaced by the complete sequence of the human genome. Until then, this reference resource should contribute substantially to the advancement of structural and functional genomics, to comparative biology, and to the isolation of human disease genes, particularly those underlying complex traits.

REFERENCES AND NOTES

View Abstract

Navigate This Article