Research Articles

A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome

See allHide authors and affiliations

Science  31 May 2002:
Vol. 296, Issue 5573, pp. 1661-1671
DOI: 10.1126/science.1069193

This article has corrections. Please see:

Abstract

The high degree of similarity between the mouse and human genomes is demonstrated through analysis of the sequence of mouse chromosome 16 (Mmu 16), which was obtained as part of a whole-genome shotgun assembly of the mouse genome. The mouse genome is about 10% smaller than the human genome, owing to a lower repetitive DNA content. Comparison of the structure and protein-coding potential of Mmu 16 with that of the homologous segments of the human genome identifies regions of conserved synteny with human chromosomes (Hsa) 3, 8, 12, 16, 21, and 22. Gene content and order are highly conserved between Mmu 16 and the syntenic blocks of the human genome. Of the 731 predicted genes on Mmu 16, 509 align with orthologs on the corresponding portions of the human genome, 44 are likely paralogous to these genes, and 164 genes have homologs elsewhere in the human genome; there are 14 genes for which we could find no human counterpart.

The laboratory mouse is an invaluable model for helping us understand human biology and disease. The mouse genome sequence, in combination with the recently reported human genome sequence (1, 2), offers the opportunity to rapidly improve our understanding of the relevance and importance of mouse models of human disease and the molecular bases for the similarities and differences between them and the corresponding human conditions (3). In addition, comparison of the complete DNA sequence of mice and humans will provide insights into the organization and evolution of the mammalian genome. To illustrate the utility of our whole-genome shotgun assembly of the mouse genome, we present the detailed architecture of one particular chromosome (Mmu 16) and compare it with the human genome. This comparison illustrates the power of comparative genomics based on nearly complete large-scale DNA sequence information.

The nearly complete sequence of a typical mouse chromosome (Mmu 16), presented here, was determined as part of the whole-genome shotgun sequencing and assembly of the mouse genome, a strategy that was used successfully for the Drosophila(4–6) and human genomes (1). Mmu 16 was chosen from this assembly for analysis because it shares a large region of synteny, about 25 megabase pairs (Mbp), with human chromosome 21 (Hsa 21), which has been extensively characterized (7). This assembly was carried out with DNA sequence derived from four strains of laboratory mice [A/J, DBA/2J, 129X1/SvJ, and 129S1/SvImJ (8a)] that were chosen partly because they complement the C57BL/6J strain, which is being sequenced by a separate mouse genome sequencing effort (8b). In addition, these four strains belong to distinct lineages of the laboratory mouse (9, 10), and they differ in numerous traits of biological and medical interest.

Chromosome 16 Sequence

Procedures for DNA extraction, library construction, and DNA sequencing are modified from those described in (1,11). The resulting data set consisted of 27.4 million sequencing reads sufficient to cover the genome 5.3 times. These sequences came from both ends of stringently size-selected 2-, 10-, and 50-kbp clones derived from randomly sheared mouse genomic DNA. We estimate that the combined lengths of these clones cover the genome 44 times. Over 80% of all sequencing reads could be associated as pairs coming from the ends of any given clone.

This data set, generated solely at Celera, was then analyzed with the whole-genome assembler previously used to produce the sequence of the Drosophila and human genomes (1,4). To evaluate the chromosome 16 statistics, we point out that this whole-genome assembly resulted in 19,788 scaffolds (contigs that are ordered and oriented with information from paired reads) spanning 2446 Mbp of the mouse genome. In this assembly, 50% of the bases (the so-called N50 statistic) are in scaffolds of at least 4.476 Mbp, and the N50 contig size is 14,559 bp (12).

We mapped whole-genome assembly scaffolds to the chromosomes by pairing the locations of known markers on the scaffold to the locations of the same markers on public genetic and radiation hybrid (RH) maps (1, 13a, 13b). These maps contained a total of 545 unique sequence-tagged site (STS) markers on chromosome 16, of which 510 (93.6%) were found on scaffolds in our mouse assembly, 9 (1.3%) were found on scaffolds that are composed of conserved repeat sequences, 13 were found (2.4%) on unassembled fragments, and 13 (2.4%) were not present in our mouse sequence. Of the 510 STSs found on scaffolds, 498 were on scaffolds that ultimately mapped to chromosome 16 (Tables 1 and 2). The remaining 12 (2.2%) STSs were found on scaffolds where the preponderance of evidence placed them elsewhere in the mouse genome. We excluded the possibility that these scaffolds were chimeric—i.e., containing piece(s) of chromosome 16 and some other chromosome—by analyzing the clone coverage of these “mixed” scaffolds (1). The results indicated that all the suspect regions have strong clone and fragment coverage, suggesting that assembly was unlikely to be the source of the unexpected placement (12). We then used the bin assignments of STS markers from Whitehead genetic and RH maps to order and orient the scaffolds. One additional scaffold of 105 kbp was recruited to chromosome 16 because it contained end sequences from two bacterial artificial chromosome (BAC) clones, which were mapped to Mmu 16 (14).

Table 1

Scaffold statistics for mouse chromosome 16.

View this table:
Table 2

Scaffold mapping of Mmu 16 by STS markers. Scaffolds were ordered and oriented along the chromosome by analysis of their STS content. At least two STS markers of consistent order were required to consider a scaffold “ordered and oriented.” If the order or orientation of a scaffold was ambiguous, usually because of a small number of matching STSs, the scaffold was deemed “bounded and unoriented.”

View this table:

Twenty scaffolds (92 Mbp) mapped to chromosome 16, 14 of which are >1 Mbp in size, with the largest being 15.65 Mbp (Table 2). These scaffolds contain 8635 contigs that cover 87,008,971 bp; the longest contig is 117,734 bp. The smallest six scaffolds cover only 434,340 bp, so that more than 99% of the bases in the mapped scaffolds are in scaffolds of >1 Mbp in length. Of the 19,788 total scaffolds, 2260 (95% of total scaffold length) could be assigned to chromosome locations. Therefore, based on the unmapped portion of the assembled mouse genome (53.9 Mbp, excluding the Y chromosome) (12), we estimate that an additional 2 to 3 Mbp of DNA maps to Mmu 16. The sequence that is missing in the gaps between contigs and scaffolds is largely composed of (i) short regions that lacked any read coverage due to random sampling and (ii) repeats that could not be entirely filled with mate pairs. Based on our experience with the human genome, we expect that some large (>20 kbp), nearly identical duplicated regions of the genome might be underrepresented in the scaffolds and contribute to the number of interscaffold gaps.

Perhaps the best measure of assembly accuracy is comparison with independently sequenced clones from the same genome. Seven BACs from the Cat eye syndrome region on chromosome 16 (15) have been sequenced, and the structure and sequence of the Celera scaffold corresponding to this region are in good agreement with the structure and sequence of each BAC clone (Fig. 1).

Figure 1

Dot-plot analysis of a portion of Celera (CRA) scaffold GA_x5J8B7W4YGF and BAC sequences. Each dot represents a match of at least 98% sequence identity and is plotted at 200 bp per pixel. GenBank accession numbers from left to right: AC018559, AC007844,AC012397, AC009192, AC006447, AC006404, AC006945. Small discontinuities or “chatter” in the plot reflect gaps between contigs in the scaffold assembly (which are expected) and do not indicate problems with contig order and/or orientation.

The sequence of Mmu 16 has been deposited at DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank as a whole-genome shotgun project under project accession number AAAD00000000. The version described here is the first version, accession number AAAD01000000. Additional information is available atwww.celera.com/mouse16.

Gene Annotation

The methods for computational sequence analysis and automated gene annotation for the mouse genome used the same basic tools described for annotation of the human genome (1, 16). The autoannotation pipeline predicted 1055 genes with high to medium confidence (17) on Mmu 16. The gene predictions with weak confidence (unique ab initio gene predictions supported by only one type of homology evidence) are not included in this report or in the subsequent analysis. We manually examined the 1055 high- and medium-confidence gene predictions to facilitate comparison of the inventory of genes on Mmu 16 with the homologous genes in the human genome. On closer examination, taking into account genes that might have been split or merged in the autoannotation process, and after discounting pseudogenes and viral-related sequences (such as endogenous retroviruses), we reduced these 1055 predictions to 731 genes. The Refseq database (18) reports 130 genes that have been mapped to Mmu 16; of these, 129 correspond to genes we have predicted on chromosome 16, and one of the Refseq genes maps to Mmu 14 in our assembly. Genes on Mmu 16 cover, on average, 23,219 bases of chromosomal DNA, compared with 27,894 bases for genes in the human genome (1). This estimate is based on the average span covered by RefSeq transcripts, which represent the highest confidence set of gene predictions. A diagram of the gene content and other physical features of chromosome 16 is shown in Fig. 2.

Figure 2

Identification of Regions of Conserved Synteny

Despite the considerable evolutionary time since the ancestors of human and mouse lineages diverged, we expected that many segments in both genomes would be sufficiently conserved to be identified by similarity search between the two sequences while being substantially unique within each genome. Such segments would permit inference of orthologous regions (similarity by virtue of shared ancestry) between the two genomes. Orthologous mouse and human exons are frequently >80% sequence identical (19), and significant and unique matches between human and mouse have been noted in noncoding sequences, presumably as a result of selection pressures (structural RNAs, regulatory regions, and so forth) (20,21) and perhaps also as a result of chance preservation of the sequence of the common ancestor. Conserved sequences in upstream putatively regulatory sequences have been noted between different species in the genus Caenorhabditis and betweenDrosophila melanogaster and D. virilis. Identification of such sequences provides a valuable complement to sets of orthologous protein sequences for several purposes—e.g., phylogenetic footprinting in regulatory regions (22), identification of novel gene candidates where no transcripts had previously been predicted, and finer-grained inference of conserved synteny and mapping of syntenic breakpoints. Consequently, paired segments of human and mouse sequence that are putatively orthologous were identified by comparison of the predicted proteins from each genome as well as by comparison of the genomic DNA sequence.

Synteny based on DNA comparisons. We use the term “syntenic anchors” to describe conserved locations in the two genomes that are identified by significant DNA sequence similarity and that constitute a bidirectionally unique match (two segments are designated as syntenic anchors if their alignment is the only significant match either segment shows to the other genome) (23). A total of 11,822 syntenic anchors mapped to chromosome 16 with a mean length of 198 bp and a mean identity of 88.1%; 11,496 chromosome 16 anchors matched human scaffolds that had been mapped to a specific chromosome, with an additional 426 anchors mapping to a human scaffold whose location in the genome is unknown (Table 3). This corresponds to an anchor about every 8 kbp. As expected, anchors are not uniformly spaced. Indeed, the largest “gap” between two anchors is 3,461, 469 bp for human autosomes (24), and 20% of the (human) genome is such that the distance from one anchor to the next is at least 100,000 bp. In the mouse genome, the largest gap between anchors is 2,347,210 bp; for Mmu 16, it is 707,282 bp (12).

Table 3

Features of regions of synteny shared between Mmu 16 and various regions of the human genome.

View this table:

We found that not only was there a substantially nonrandom distribution of anchors relative to genes, but also that anchors are not simply conserved exons; only 34% of all anchors overlapped annotated mouse exons, and 56% were found within the boundaries of annotated mouse genes. The remainder (44%) were in intergenic regions. Interestingly, the sizes of anchors did not differ among those found in genes and those found in intergenic regions. Moreover, as described below, the density of anchors appears to be much less affected by gene density than expected, which makes anchors an important complement to protein-based markers of conserved synteny.

Syntenic anchors between mouse and human are remarkably consistent. Over 50% of syntenic anchors on Mmu 16 are in runs of at least 128 in a row with the same order and orientation in each of the genomes and with no additional (intervening) anchors. In addition, 50% of the (human) genome is in such uninterrupted, ordered, and oriented blocks, defined solely by anchors, that are at least 994 kbp (12) long. These values are somewhat lower than would be expected given perfect, complete assemblies, because errors or unmapped sequences in one genome lead to breakpoints in the correspondence between the two genomes. Many of these breakpoints involve more than local inversion or transposition. We observed a number of small local rearrangements, inversions averaging 7.2 kbp and transpositions averaging 2.5 kbp, involving short runs of anchors (25). Of 193 breakpoints identified on Mmu 16 mapped scaffolds that were not coincident with either mouse or human scaffold boundaries, 63 were due to 32 local inversions and 10 were due to five local transpositions. By excluding breakpoints that resulted from single inconsistent anchors, we found 48 breakpoints on Mmu 16 not at scaffold boundaries, of which 22 are caused by 11 inversions and 2 are caused by a single transposition. The separation between adjacent anchors was consistently shorter in mouse than in human for pairs of anchors contained within the same scaffolds on both assemblies (mean 7167 bp for human, 6188 bp for mouse) and for pairs contained within the same contigs (2429 bp for human, 2034 bp for mouse), which is consistent with a smaller genome size for mouse than for human, as discussed more extensively below.

Mmu 16 shares regions of conserved synteny with six human chromosomes: 3, 8, 12, 16, 21, and 22 (Fig. 3) (26–31). The pattern of anchor distribution, in terms of lengths of runs of consistent anchors, is similar to that described above. Table 3 presents the total number of anchors between the regions of conserved synteny involving Mmu 16 and the corresponding portions of human chromosomes. Inconsistent anchors—i.e., anchors that are not consistently placed relative to their neighbors on both genomes—are also enumerated, and they account for only about 1.14% of all anchors between the regions of shared synteny.

Figure 3

Regions of conserved synteny between Mmu 16 and the human genome. The analysis was done at the protein level with the MUMmmer program; each line represents a pair of orthologous genes present in the mouse and human. Cyto, cytogenetic markers; SCF, scaffold distribution; and Ori, orientation of scaffolds.

In general, the length of regions of conserved synteny tend to be longer on the human chromosome than on the corresponding mouse chromosome (Fig. 3). The sizes (in kbp) of each of these chromosome segments in humans, relative to their mouse counterparts, are shown inTable 3. In every case, the human genomic block is larger than that of the corresponding segment in the mouse genome, even though it harbors a similar genetic content. In total, the mapped portion of Mmu 16 consists of 92 Mbp of DNA, whereas the sum of the corresponding human blocks is 108 Mbp. In the regions of conserved synteny with Mmu 16, for example, short interspersed nuclear elements (SINEs) account for 31.6% of the bases in the human genomic segments, whereas SINEs account for only 21.7% of the bases on Mmu 16. For the same regions, long interspersed nuclear elements (LINEs) account for 16.4% of bases in human and only 12.3% of bases in mouse. Genomewide SINEs plus LINEs account for about 46% of bases in the human and only about 36% of bases in the mouse genome. This difference faithfully mirrors the genomewide disparity in the sizes of the mouse and human euchromatic regions.

Synteny based on protein comparisons. Another method used to identify regions of conserved synteny between the human genome and Mmu 16 was based on protein comparisons and originally was devised to identify human intragenomic duplications (1). This method identifies clusters of predicted proteins and their homologs that have conserved relative order on two different chromosomes. Identification of similar proteins is determined by the suffix-tree comparison method MUMmer (32) or, alternatively, by matches that have mutual best BLAST scores. Briefly, when at least three proteins (33) within a small interval along a chromosome can be aligned with three similar proteins along a target chromosome, this correlation of clusters is the basis for assertion of a “conserved synteny.” As applied to find syntenic stretches between the mouse and human genomes, this method found only one human chromosomal stretch for each Mmu block (34). These syntenic blocks consisted of dozens to hundreds of identically ordered genes on the mouse and human chromosomes (Fig. 3). Because similar syntenic blocks were delineated regardless of whether the criteria for protein matches were MUMmer identities between proteins or best BLASTP scores in the target genome, results for the two analyses were merged. At the criterion used (34), 99% of chromosome 16 could be mapped to single, unique homologous human chromosome segments.

For Mmu 16, the 731 gene predictions ordered along the mouse chromosome matched genes from seven distinct syntenic blocks on six human chromosomes. These blocks, in the order of their alignment along Mmu 16 from telomere to telomere, are Hsa 16, 8, 12, 22, 3 (two separate blocks aligning to two different regions of Hsa 3), and 21 (Figs. 3 and 4; Table 3). The findings agree with previous partial descriptions of mouse-human conserved synteny, with the exception of a region on Hsa 12 that has not been previously described, and augment those findings with greatly increased resolution. Ninety-eight percent (717/731) of the high-confidence gene predictions have homologs in the human genome, and 76% (556/731) have homologs in the corresponding syntenic block in human. Ninety-two percent (509/556) of these homologs are likely to be orthologs (35) of the human genes, and the remaining 8% (44/556) represent unequal local expansions of these genes, that is, paralogs that have arisen since the mouse and human lineages diverged. As discussed below, the 509 pairs of homologous genes that map to regions of conserved synteny between the mouse and human genomes are very likely to be orthologs; this assertion is weaker for the 164 pairs of homologous genes that map elsewhere. Most of these pairings probably do not represent orthologous relationships, because the percentage of identical residues for the orthologs as asserted above is 86%, whereas these pairings have, on average, 69% identical residues and because these pairings are not reciprocal best matches between the human and mouse proteins. Figure 4 illustrates the differences in the distributions of expectation values for these two sets of homologous genes.

Figure 4

Distribution of mouse-human orthologs and best hits mapping to syntenic and nonsyntenic regions based on sequence homology and expect value scores. A log expect value score of <1 × 10−180 is represented as 0.

For 14 mouse genes, we could find no related human genes in either our assembly of the human genome (1) or in any other databases (see supplementary table 1 on Science Online atwww.sciencemag.org/cgi/content/full/296/5573/1661/DC1). These mouse genes could either be specific to the mouse genome, or they could have functionally active, but extremely diverged (i.e., unrecognizable), counterparts in the appropriate region of conserved synteny, or elsewhere, in the human genome. In particular cases, we found evidence for remnants of small open reading frames in syntenic regions of the human genome that correspond to remnants of orthologous mouse genes. These sequences are noncoding and almost certainly decayed footprints (pseudogenes) of orthologous mouse-human pairs. Finally, we found an additional 33 predicted “genes” that are related to retroviruses and another 65 that are pseudogenes. The last two categories are not included in the 731 predicted protein coding genes assigned to Mmu 16.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 21. The region of conserved synteny between Mmu 16 and Hsa 21 corresponds to 22.37 Mbp of the mouse chromosome and 28.42 Mbp of the human chromosome. This region contains major determinants for Down syndrome, and trisomy of this region through a Robertsonian translocation (36) and segmental trisomy (37, 38) have provided mouse models for Down syndrome. We have been able to assign 129 orthologous gene pairs to this region, which extends from near the humanSTCH gene (microsomal stress adenosine triphosphatase core) to the ZNF295 gene (a Kelch-related transcription factor). The order of the 111 mouse genes in our assembly of Mmu 16 corresponds exactly to the order of published human genes in this region (7) except for one small, previously known inversion (39) near the Bace2 gene (position 93.2 Mb in Fig. 2).

Gene density within this syntenic region is not uniform. For example, there is a region on Hsa 21 with only 7 genes in 7.8 Mbp from PRSS7 (an enterokinase) to APP (the amyloid A4 precursor protein implicated in Alzheimer's disease). The corresponding region in the mouse is slightly smaller, about 6 Mbp, and has the same gene content with no addition or loss of transcription units. The sizes of the orthologous proteins are similar, indicating that there have been no major gains or losses of protein domains with orthologs. The size and coding capacity of this syntenic semidesert have remained largely intact since the divergence of lineages leading to humans and mice. In addition, about 400 conserved syntenic anchors are evenly spread throughout this desert. These conserved regions do not correspond to any of the repetitive elements that occur throughout this semidesert, and the reasons for their conservation remain obscure.

To contrast a gene-poor region with a more populated one, we examined the 17-gene region on Hsa 21 extending from the human interferon α/β receptor (IFNAR2) to DSCR1(the human Down syndrome candidate region protein). This region is ∼1.08 Mbp in length in mouse and 1.4 Mbp in human; it contains the human genes IFNAR2, IL10RB, IFNAR1,IFN, GART, RPS5L, ATP50,SON, ITSN, KCNE1, andDSCRI, most of which are implicated in predispositions to various human diseases and one of which, the interleukin-10 receptor β (IL10RB), is a known drug target. This portion of Mmu 16 has the same complement of genes in exactly the same order.

We also compared our automated annotation with a previously published annotation of the 4.5-Mbp region fromCbr (carbonyl reductase) with Tmprss2(transmembrane protease, serine 2) (39). One mouse gene in this region, Itgb21 (integrin-β2-like), does not have an ortholog in the syntenic region of Hsa 21 between B3galt5(β1,3-galactosyltransferase) and Pcp4 (Purkinje cell protein 4). We confirmed the absence of a detectable ortholog ofItgb21 and found an additional gene betweenItgb21 and Pcp4.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 16. The region of conserved synteny corresponds to ∼10.32 megabases of the Mmu and 12.33 Mbp of Hsa 16p. There are 87 orthologs that extend from near the ortholog of the zinc finger gene Znf174 to the gene for the multidrug resistance proteinAbcc1. Of these 87 orthologous pairs, only one is in a noncolinear position between the two genomes.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 8. This region of conserved synteny covers about 1.29 Mbp of Mmu 16, which corresponds to a 1.39-Mbp region of Hsa 8. There are six orthologous gene pairs in this region.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 12. This particular region of synteny between the two genomes was previously unidentified. Based on its content of orthologous genes and anchors, we find that it constitutes a 0.363-Mbp region of Mmu 16. and a 0.470-Mbp segment of Hsa 12. The human segment consists of only three genes, which have orthologs in the mouse.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 22. The region of conserved synteny between Mmu 16 and Hsa 22 corresponds to 2.08 Mbp of the mouse chromosome and 2.27 Mbp of the human chromosome. The gene content is 30 orthologous loci that are in conserved order except for a block of eight genes, which is inverted in mouse relative to human. The entire region of conserved synteny extends from near the Top3b (DNA topoisomerase III β1) gene to near the HIRA (histone cell cycle regulation defective, Saccharomyces cerevisiae) homolog A gene. Previous reports (40) showed a segment of Mmu 16 distal to the portion corresponding to Hsa 22 as corresponding to a portion of Hsa 18. We find no syntenic region or even single genes related to those on Hsa 18. Indeed, the single reliable datum that had been used to establish this conserved synteny, the EIF4agene, was found in both recent publications of the human genome sequence (1, 2) to map to Hsa 3q27.2. This places it within the syntenic block between Hsa 3q27–29 and Mmu 16 that follows the block of Mmu 16 syntenic to Hsa 22, and not within Hsa 18.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 3q27–29. The region of conserved synteny between Mmu 16 and Hsa 3q27–q29 corresponds to ∼13.56 Mbp in mice and 16.46 Mbp in humans. This region has 107 orthologous pairs of genes that have a conserved order between mouse and human.

Description of individual syntenic regions: the region of Mmu 16 corresponding to a block of Hsa 3q11.1–13.3. The region of conserved synteny between Mmu 16 and Hsa 3q11.1–13.3 corresponds to ∼41.66 Mbp in mice and 46.49 Mbp in humans. The gene order for mouse and human appears to be conserved across this region. Much of this large chromosome segment is gene-poor in both the mouse and human genomes. This segment contains 165 orthologous pairs of genes, or about 4.15 genes (with orthologs in the human genome) per Mbp on Mmu 16 versus an average of 8.25 genes per Mbp in the other region of conserved synteny with Hsa 3 described above. The gene density is even lower for this segment of Hsa 3 with 3.72 genes per Mbp contrasting with 6.8 genes per Mbp for the 3q27–29 interval.

The Mosaic Nature of Mammalian Chromosomes

The mosaic patterns in the organization of various mammalian genomes must, in part, reflect the rearrangements of these genomes over their evolutionary history. The relation between various structural features of Mmu 16, including gene density, G+C content expressed sequence tag (EST) density, SINE density, and LINE density, and how these features relate to the boundaries of regions of conserved synteny with the human genome is shown in Fig. 5. Three of the six syntenic boundaries (those between the regions of conserved synteny between Mmu 16 and Hsa 16 and Hsa 8, Hsa 12 and Hsa 22, and Hsa 22 and Hsa 3q27) show marked discontinuity for several of these features—namely, G+C content, SINE density, and LINE density (Fig. 4). Not only are there sharp discontinuities at these boundaries, but the density of the feature is often more uniform, and divergent from the flanking regions, in the syntenic block than over the local chromosomal neighborhood. This is probably best illustrated by the region of shared synteny between Mmu 16 and Hsa 22. G+C content on Mmu 16 immediately (2 Mbp) before the portion syntenic to Hsa 22 is about 40% and abruptly changes to nearly 50%, which is maintained across the entire 2 Mbp of the Hsa 22 syntenic region, and this drops to below 40% immediately following the Hsa 22 region (Fig. 5). This pattern holds for SINE and LINE density in this region. Here, the mosaic nature of this portion of Mmu 16 relative to human would seem to be explained by the breaking and joining of chromosomal regions, as they existed in an early ancestor, with very disparate properties. The implications of these discontinuities, and their absence at other boundaries (Fig. 5), for the mosaic pattern of mammalian chromosome evolution is discussed below.

Figure 5

Correlations of gene density (red), G+C content (green), EST density (blue), SINE density (purple), and LINE density (brown) between Mmu 16 with the regions of conserved synteny in the human genome. The syntenic boundaries are labeled A (Hsa16 and Hsa 8), B (Hsa 8 and Hsa 12), C (Hsa 12 and Hsa 22), D (Hsa 22 and Hsa 3q27–29), E (Hsa 3q27–29 and Hsa 3q11.1–13.3), and F (Hsa 3q11.1–13.3 and Hsa 21).

Discussion

What are the salient features that emerge from comparison of an initial examination of a mouse chromosome with the regions of conserved synteny in the human genome? The first is that large regions of this chromosome have been remarkably conserved during the more than 100 million years that have elapsed since the lineages leading to humans and mice diverged. One-third of Mmu 16, ∼32.8 Mbp, that has conserved synteny with Hsa 16 and Hsa 21, has preserved gene content and gene order, with only two exceptions. The remaining 60 Mbp correspond to five other regions of the human genome in six conserved syntenic segments. Another measure of the remarkable conservation between the corresponding regions comes from the analysis of syntenic anchors. The order and orientation of these anchors are highly conserved; of the 11,496 sequence pairs that were identified between Mmu 16 and the corresponding regions of the human genome, only about 1% were not strictly conserved in order and orientation relative to their neighbors.

Of the 731 proteins coding genes that we have identified on Mmu 16, 509 have orthologs in the human genome based not only on their sequence similarity but also on conservation of their map position in regions of shared synteny, as determined by the local consistency of syntenic anchors. Although the assertion of orthology can be problematic (35), this combination of features is currently the strongest evidence available for such an assertion, which is important because it allows one to predict conservation of function with greater confidence, especially in those gene families in which inferences about functional relations between family members in the two species are difficult to interpret owing to large-scale expansion.

We found 14 putative genes on Mmu 16 for which we could find no counterpart in humans. Thus, the percentage of genes that are unique to the mouse lineage, based on what we have found from chromosome 16 (14 of 731 genes), is likely to be ∼2%. A similar figure, 2.9% (21/725), is found for genes that are unique to humans, based on the number of genes that are present in humans and absent in mice in the regions of conserved synteny with Mmu 16.

The differences in gene number in any given region of conserved synteny are mainly attributable to differences in the number of paralogous genes in one species compared with the size of the locally expanded family in the other species, although the differential use of exons sometimes produces orthologous proteins that differ by a protein domain and hence have functional consequences (41). At a more detailed level, our results are consistent with, and extend, a number of smaller scale efforts that have compared parts of the mouse and human genomes, such as the 4.5-Mbp region of mouse 16 fromCbr1 to Tmprss2 (39), as well as an analysis in which Hsa 19 was compared with corresponding portions of the mouse genome (42).

Numerous authors have noted mosaic patterns in the organization of the mammalian genome (43,44). The distributions of many features such as gene density, G+C content, and density of repetitive elements are not uniform across the genome. Besides varying greatly among chromosome segments, the distributions of these features are not smooth but are characterized by rather sharp discontinuities. The reasons for these patterns are unclear. Mammalian evolution has been accompanied by rearrangements of the ancestral karyotype, leading to a wide divergence of chromosome number, from 3 pairs in the muntjac deer to 67 pairs in the black rhinoceros (45). There can be very extensive rearrangements even between closely related species such as two species of muntjac deer that have 3 and 23 pairs of chromosomes, respectively. The genomic rearrangements in these deer species have been accompanied by minimal effects on morphology, and they can be interbred to yield viable and healthy hybrids (46).

The remarkable conservation that we have observed in the regions of shared synteny between portions of Mmu 16 and the human genome suggest that these regions preserve the basic character of the regions as they existed in the early mammalian ancestor of both the murine and primate lineages. There is a large descriptive and empirical literature as well as considerable analytical, theoretical, and computational work that has been done on this problem, all of which is largely consistent with the “random breakage” model first proposed by Nadeau and Taylor (47) and is consistent with the most recent analysis (2).

Examining the various genomic features at the boundaries between syntenic regions (Fig. 4) provides an improved understanding of the mosaic patterns of the evolution of mammalian chromosomes. Notably, syntenic boundaries that do not show sharp transitions in these various features may provide evidence for conservation of the original (ancestral) pattern in the lineage that lacks such a transition. On Mmu 16, the boundary between the region syntenic to Hsa 3q11.1–13.3 and the region syntenic to Hsa 21 shows no sharp discontinuity for any of these features. Is there any evidence that the configuration seen in Mmu 16 represents an ancestral arrangement whereas the different arrangement in humans, where this region is split between Hsa 3 and Hsa 21, is derived? Chromosome 1 of the bovine genome shares 10 genes with Mmu 16, 5 of which are found on Hsa 3 and 5 of which are found on Hsa 21. This supports the Mmu 16 configuration as being ancestral. Given a monophyletic origin of mammals between 150 and 200 million years ago, it is remarkable that the signature of the ancestral genome structure is still preserved. As we extend our analysis of the mouse and other mammalian genomes, we may hope to reconstruct the ancestral mammalian karyotype, a problem that has been called “original synteny” (48).

The syntenic regions in the human genome are about 10% larger than those in the mouse, and this figure is the result of a larger proportion of repetitive sequences (SINE and LINE elements) in the human genome. Certain chromosomal regions such as the gene-poor desert in both Mmu 16 and Hsa 21 have largely retained their size since the evolutionary separation of the two lineages. Whether this is due to functional constraints or simply to the slow rate of DNA loss or addition is not known. Previous studies of small-scale sequenced chromosomal regions of similar genic content between the human and mouse genomes have revealed that, in general, a given human segment is larger than the corresponding mouse segment (15,42, 49). Reassociation data originally suggested the existence of a fraction of human dispersed repetitive DNA that was lacking from the mouse genome, (50, 51). A major contributor to this difference is the substantially larger fraction of SINE elements in the human genome as compared with that of the mouse, an observation that others have made for smaller sets of comparative data (42).

The insertion of retroviral sequences is one of the more dynamic processes in genome evolution. This phenomenon has been widely studied (52) and accounts for many of the differences between the mouse and human genomes. The distribution of LINE elements, the general class of retrotransposons, in mammals is radically different between mouse and human (12). Comparative analysis yields the following summary results: 33 retroviral-related sequences were found on Mmu 16. For the corresponding syntenic regions of the human genome, the number was 19. A full review of this topic is beyond the scope of this study.

Only with genome assemblies that are robust and have extensive long-range contiguity can the sorts of analyses reported here be obtained. Examples such as those highlighted here illustrate the power and importance of comparative genomics to clarify the relationships between the genomes of various organisms and to understand where they are sufficiently conserved that they specify similar biology. The availability of sequenced mouse and human genomes holds the promise that we shall be able to more rapidly identify the genes and associated regulatory elements that are critical to any biological phenomenon and to filter and validate those that are relevant to intervention in disease. Although the work presented here represents only a small beginning in analysis of the mouse genome, it is clear that the tools are in place for comprehensive studies of the evolutionary and functional relationship between the mouse and human genomes (53a).

  • * To whom correspondence should be addressed. E-mail: richard.mural{at}celera.com

  • Present address: TIGR Center for the Advancement of Genomics, 1901 Research Boulevard, Suite 600, Rockville, MD 20850, USA.

REFERENCES AND NOTES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8a.
  9. 8b.
  10. 9.
  11. 10.
  12. 11.
  13. 12.
  14. 13.
  15. 14.
  16. 15.
  17. 16.
  18. 17.
  19. 18.
  20. 19.
  21. 20.
  22. 21.
  23. 22.
  24. 23.
  25. 24.
  26. 25.
  27. 26.
  28. 27.
  29. 28.
  30. 29.
  31. 30.
  32. 31.
  33. 32.
  34. 33.
  35. 34.
  36. 35.
  37. 36.
  38. 37.
  39. 38.
  40. 39.
  41. 40.
  42. 41.
  43. 42.
  44. 43.
  45. 44.
  46. 45.
  47. 46.
  48. 47.
  49. 48.
  50. 49.
  51. 50.
  52. 51.
  53. 52.
  54. 53.
  55. 54.
  56. 55.
  57. 56.
  58. 57.
  59. 58.
  60. 59.
  61. 60.
  62. 61.
  63. 62.
  64. 63.
  65. 64.
View Abstract

Navigate This Article