Review

Conservation and Novelty in the Evolution of Cell Adhesion and Extracellular Matrix Genes

See allHide authors and affiliations

Science  11 Feb 2000:
Vol. 287, Issue 5455, pp. 989-994
DOI: 10.1126/science.287.5455.989

Abstract

New proteins and modules have been invented throughout evolution. Gene “birth dates” in Caenorhabditis elegansrange from the origins of cellular life through adaptation to a soil habitat. Possibly half are “metazoan” genes, having arisen sometime between the yeast-metazoan and nematode-chordate separations. These include basement membrane and cell adhesion molecules implicated in tissue organization. By contrast, epithelial surfaces facing the environment have specialized components invented within the nematode lineage. Moreover, interstitial matrices were likely elaborated within the vertebrate lineage. A strategy for concerted evolution of new gene families, as well as conservation of adaptive genes, may underlie the differences between heterochromatin and euchromatin.

The genome of the nematodeCaenorhabditis elegans, now fully sequenced, affords remarkable insights into the origin and nature of multicellular life (1). Moreover, it raises challenging, often unforeseen, questions about the molecular processes and evolutionary consequences of genome change. Some 20% of C. elegans genes have orthologs in the budding yeast Saccharomyces cerevisiae(2) that function in cellular processes common to all eukarya. Beyond those shared with yeast, about 30% of C.elegans genes have known orthologs in insects or vertebrates that are involved in developmental and physiological processes common to all higher animals (3–5). The remaining genes are thus far found only in nematodes (6). About half are single-copy genes and could represent ancient genes not yet discovered in other phyla. If so, as many as 50% of all C.elegans genes arose sometime between the radiations of cellular eukarya [about 2 gigayears ago (Gya)] and metazoa (about 0.8 Gya), and are therefore expected to be found in all higher animals.C. elegans is an excellent experimental model for studying conserved functions of these inherently metazoan genes. Finally, comparison of the C. elegans genome with other nematodes, and with itself, reveal robust, ongoing processes of gene invention (7–9).

We examined the evolution of extracellular matrix and cell adhesion molecules, protein classes that frequently overlap in structure or interact molecularly. To identify candidate genes, we used representative insect and mammalian proteins, or their fragments, as queries for BLAST searches of Wormpep (10, 11). From these initial hits, we performed reciprocal BLAST searches to identify potential insect or mammalian orthologs in GenBank, and to expand the sample of nematode proteins. Direct searches against Wormpep allowed identification of nematode-specific protein domains and families (12). For all protein domains summarized in Web table 1 (13) and discussed below (12,14), this search cycle proved a sensitive means of detection with no false-negatives to the best of our knowledge. To confirm known protein domains and to count tandem repeats, we used Pfam profiling with the hidden Markov model algorithm, HMMer (15); profile parameters were set to their most sensitive value, allowing for module fragments due, for example, to imperfect GENEFINDER predictions. We used manual sequence alignment assisted by CLUSTAL to define new protein domains. Potential signal sequences, transmembrane helices, or glycosyl-phosphatidylinositol (GPI)-anchoring signals, were identified by PSORTII (16). Finally, these genes were sorted by chromosome position with known genetic loci. Proteins with orthologs in all eukarya, for example, ribosomal proteins, histones, and tubulins, were included for comparison. Genes with known mutations were identified from a C. elegans database (ACEDB); cDNA matches were identified from BLASTN searches, or counted from online lists, correcting for duplicate entries of 5′ and 3′ expressed sequence tags (ESTs) from a single cDNA clone (17). Genes are identified below by their Wormpep accession numbers (11). In addition, where available, protein and gene names are appended to these Wormpep accession numbers using colons and parentheses, respectively. Where a single protein apparently comprises two or more separate database entries, we list NH2-terminal fragments first, e.g., ZK944.4/ZK944.3. Further information regarding our analysis is described online atwww.mpimf-heidelberg.mpg.de/ewgdn/genome_paper/.

Basement Membrane Proteins and Receptors

Basement membranes are polymeric sheets of laminin, collagen IV, and associated proteins found on the basal surfaces of epithelia and condensed mesenchyma that provide a substratum for attachment and present a barrier to cell mixing during development (18). Basement membrane components are among the oldest and most conserved extracellular matrix proteins (19). In C.elegans, two distinct laminin molecules, designated αΑβγ and αΒβγ, arise from four laminin chain genes, αΑ::T22A3.8, αΒ::K08C7.3 (epi-1), β::W03F8.5 (lam-1), and γ::C54D1.5 (Fig. 1). Comparison of these four genes suggests that exchange between two genes of protolaminin (with subunit composition δɛɛ) resulted in two parental, δNδCand ɛNɛC, and two recombinant, δNɛC and ɛNδC, chains, seen in nematodes today. Laminins have duplicated further within the vertebrate lineage. Thus, αΑ and αΒ branches split into α1/α2 and α3/α4/α5 chains, respectively (20).

Figure 1

Organization of selected basement membrane molecules. Lettered boxes represent identified protein motifs; numerals indicate the lengths of uncharacterized sequences (not drawn to scale). Extracellular modules and catalytic domains are described online (56). Other features, including intracellular domains, are abbreviated: C4 (collagen IV COOH-terminus), C18 (collagen XV/XVIII COOH-terminus), Ca++(calcium-binding domain), CC (coiled-coil domain), and 7S (collagen IV NH2-terminus). The suffix “#n” indicates the number of genes with this same organization. Cell adhesion molecules are available in Web figure 1 (13).

Basement membrane collagens are encoded by three genes, α1(IV)::K04H4.1 (emb-9), α2(IV)::F01G12.5 (let-2), and α1(XV/XVIII):: F39H11.4. Other basement membrane proteins have unique representatives, for example, agrin::F41G3.8, fibulin::F56H11.1, Kallmann-syndrome protein::K03D10.1, nidogen:: F54F3.1 (nid-1), osteonectin:: C44B12.2 (ost-1), and perlecan::ZC101.2 (unc-52). Several, thus far novel, matrix proteins are required for cellular attachments of mechanosensory neurons and other tissues. This category includes hemicentin::F15G9.4 (him-4), MEC-5::E03G2.3, and MEC-9::C50H2.3a (21). Adamalysin, astacin, matrixin, and neprilysin, which are among the families of metalloproteases observed, are implicated in extracellular matrix interactions or neuropeptide processing (22).

Integrins couple assembly of various extracellular matrix and cytoskeletal polymers (23). In C.elegans, there are two α chains, INA-1:: F54F2.1 and PAT-2::F54G8.3, representing the laminin-binding and RGD-binding branches of this family, respectively, and two β chains, INB-1::C05D9.3, PAT-3::ZK1058.2, which allows four possible heterodimers. The patterning of these receptors, together with the two laminins, may specify where apposing basement membranes should fuse together or remain separate.

The dystroglycan-dystrophin complex couples laminin and agrin to the membrane cytoskeleton in skeletal muscle and other tissues (24). In humans, mutations in various components of this complex, including dystrophin and sarcoglycans, as well as laminin α2 itself, cause muscular dystrophies. Components of the dystroglycan complex have conserved C. elegans orthologs, for example, dystroglycan::T21B6.1, α/ɛ-sarcoglycan::H22K11.4, β-sarcoglycan:: K01A2.1 and γ/δ-sarcoglycan::F07H5.2, dys- trophin::F15D3.9/F32B4.3 (dys-1), and various syntrophins. Family T07D3.4 encodes homologs of the secreted protein fukutin, implicated in Fukuyama-type congenital muscular dystrophy (25).

What extracellular matrix proteins are not found in nematodes? Like connective tissue itself, interstitial matrix polymers, such as elastin, fibrillar collagen, and fibronectin, are generally absent inC. elegans, suggesting that these genes were elaborated within the vertebrate lineage. Fibrillar collagen genes occur in other invertebrates, however, indicating that this and perhaps other matrix components have been lost in the nematode lineage (26).

Cell Adhesion Molecules

We classified cell adhesion molecules and other receptors into superfamilies on the basis of their NH2-terminal domains, which are largely responsible for ligand recognition. The five largest superfamilies (27) begin with CA, EG, IG, LA, or LR repeats [(14), Web table 1 and Web figure 1 (13)]. For convenience, we refer to these proteins collectively as cadherins, EgfCAMs, IgCAMs, LdlCAMs, and LrrCAMs, respectively. Finally, several well-known cell adhesion molecules that defy easy classification have apparent orthologs, for example, chondroitin sulfate proteoglycan NG2::C48E7.6, poly- cystins::ZK945.10/.9 (lov-1) and Y73F8A.B, selectin::C54G4.4, and syndecan::F57C7.3.

IgCAMs. Twenty-six genes encode predicted transmembrane or GPI-anchored proteins with extracellular IG modules (Web figure 1). With the exception of Lrr(Ig) CAMs (discussed below), the IG domain is located at the NH2-terminus. Thirteen proteins, Ig(F3)CAMs, have one or more fibronectin III (F3) modules after their IG domain, whereas one protein, UNC-5::B0273.4, has thrombospondin 1 (T1) modules instead. Two neuregulin-like proteins, F28E10.2 and F48C5.1, contain an EG module after their IG domain. The intracellular domains of the transmembrane IgCAMs have characteristic enzymatic and binding functions. Six additional genes encode extracellular matrix proteins with IG domains.

The IgCAMs are principal mediators of cell recognition and adhesion in the developing nervous system and other tissues (28). These proteins act combinatorially among themselves, and with receptors from other structural families, to pattern cell movements and attachment. In particular, GPI-anchored, or small soluble IgCAMs (discussed below), might directly modulate the ligand specificities of transmembrane IgCAMs. Whereas IG and F3 motifs are found in a few yeast proteins (2), the combination of these motifs and their recruitment for cell adhesion occurred within the metazoan lineage. The immunoglobulin superfamily has remained static in the nematode lineage, Nine of the thirteen Ig(F3)CAMs in C.elegans have known orthologs in insects and vertebrates. We predict the various new IgCAMs identified in this species have yet undiscovered orthologs in other animals.

An ancient mechanism evolved for adhesive recognition, IgCAMs and their effectors were adapted for antigen recognition in the vertebrate immune system. In lymphocytes, antigen receptors signal through fynand related src family protein tyrosine kinases to elicit cellular responses. In neurons, L1 and neural cell adhesion molecule (NCAM) can signal through these same nonreceptor tyrosine kinases (29). Covalent association of antibody heavy and light chains contributes to the specificity and versatility of antigen recognition. Reminiscent of these light chains, C.elegans has seven secreted proteins comprising just two tandem IG modules. Conceivably, these chains complex with membrane-associated IgCAMs to form functional receptors.

Semaphorins and plexins. Five genes encode proteins with a semaphorin domain at their NH2-terminus including three semaphorins, semaphorin I::Y54E5B.1, semaphorin II::Y71G12A_05.G and D1037.2, and two plexins, K04B12.1 and Y55F3B_45.A (Web figure 1). Semaphorins are guidance cues in the developing nervous system, whereas plexins, acting as semaphorin receptors, mediate growth cone collapse (30).

LrrCAMs. Twenty-three genes encode proteins with an extracellular LR-repeat domain at their NH2-terminus including apparent orthologs of slit::C26G2.C/F40E10.4, peroxidasin::K09C8.5 and ZK944.4/.3, chaoptin::C56E6.6, 18-wheeler::T05A1.3, and FSH/TSH-receptor::C50H2.1 (Fig. 1). Three of these proteins, designated Lrr(Ig) CAMs, have one or more IG modules following their LR-repeat domain, for example, GAC1::F20D1.7 and LIG-1::T21D12.9. Several LrrCAMs have been implicated in adhesive recognition in the nervous system, including regulation of synapse formation (31). Additional genes encode proteins with intracellular LR-repeat domains, which are possibly unrelated motifs that converged onto a similar protein fold (32).

Cadherins, latrophilins, and neurexins. Ten genes encode classical cadherins with a CA repeat domain at their NH2-terminus; three more genes encode FAT-related cadherins where a crumbs-like region of alternating EG and laminin G (LG) modules follows the CA domain. Classical and FAT-related cadherins are implicated in both general adhesion between cells and specialized junctions, for example, adherens, desmosomes, and synapses (33). One additional cadherin, CELSR1::F15B9.7, has a FAT-related NH2-terminus followed by laminin epidermal growth factor–like (LE) repeats and a latrophilin-related COOH-terminus (Web figure 1); a mammalian ortholog is expressed in the nervous system (34).

Latrophilin and neurexin are presynaptic membrane proteins identified as receptors for latrotoxin, a neurotoxin from black widow spider venom that triggers massive, unregulated exocytosis of synaptic vesicles from nerve terminals (35). Like CELSR1, latrophilins are members of the secretin receptor family, an ancient branch of serpentine receptors implicated in secretory coupling. Two genes encode latrophilins, B0286.2 and B0457.1, and three more genes encode secretin receptor–related proteins without large extracellular domains. These proteins could play roles in synapse formation, triggering exocytosis in response to potential synaptic targets, or synapse maintenance (36). Finally, five genes, including axotactin::W03D8.6, crumbs::F11C7.4, neurexin I/II/III::C29A12.4 and neurexin IV::F20B10.1, encode crumbs-like receptors with alternating EG and LG repeats.

New Proteins Combine Novel Modules and Select Old Parts

New genes arise from specific, often novel, feedstock, which changes over time, suggesting that not all regions of eukaryotic genomes are equally available for gene invention. Some motifs used for gene invention in early metazoans seem inert today, whereas once minor or entirely novel sequences have became important within the nematode or chordate lineages. We identified more than 40 ancient protein motifs [(14) and Web table 1], clearly predating the metazoan radiation, found in the extracellular domains of C.elegans proteins (27). By far the most promiscuous extracellular module, epidermal growth factor (EG) appears in 30 distinct structural contexts. At the other extreme, the LN motif, implicated in polymerization, occurs at the NH2-terminus of laminin chains and netrin but nowhere else. Remarkably, most ancient motifs occur in a stable set of contexts from nematodes through chordates. However, some have been used for new “gene shuffling” within specific lineages. For example, CK, FC, FS, KR, SR, and VD motifs are more promiscuous in vertebrates than nematodes (37). By parsimony, those contexts shared with nematodes likely reflect the ancestral functions of these domains. Conversely, CL and KU modules have been recruited for many novel contexts in nematodes (Web figure 1).

Some extracellular motifs present in vertebrates are apparently absent in C. elegans, and vice versa, suggesting new protein modules have been invented more or less continuously throughout eukaryote evolution (2, 6, 38). Twenty nematode-specific protein motifs were found in the extracellular domains of C. elegans proteins (12). Most of these motifs are present in only one or two structural contexts and may have duplicated quite recently. However, DC, SX, and CT modules are more promiscuous and presumably expanded early within the nematode lineage. The DC module, a 45-residue motif with six conserved cysteines, occurs in more than 60 secreted or membrane proteins representing nine distinct structural contexts (Web figure 1). It is interesting that these proteins contain various ancient modules, i.e., EG, IG, F3, KU, TY, and WA, intermixed with apparently nematode-specific motifs. Although these observations suggest a relatively ancient origin, the DC module is not currently represented in any human gene or EST sequence. By inference, this module, rare or absent in our common nematode-chordate ancestors, expanded greatly in early nematodes. Secreted proteins with SX (SXC) modules include nematode surface coat components and several enzymes possibly involved in cuticle maturation (6). Finally, the CT (cuticulin) module occurs in proteins found at the apical surface of nematode epidermis and mucosa, as well as a transmembrane protein fromDrosophila epidermis (39). Expression and phenotype studies suggest a role in epithelial morphogenesis.

New Genes Arise in Specific Regions

Many new genes have arisen within the C.elegans lineage since the metazoan radiation. Some arose through duplication of known genes; others were apparently invented within the nematode lineage itself. We examined gene families and superfamilies of various ages to learn whether new genes arise evenly throughout the genome, and to gain insight into possible mechanisms. The immunoglobulin superfamily, which has remained remarkably static within the nematode lineage, is dispersed throughout the genome as single genes, or rarely, pairs, in regions overall enriched in adaptive, often highly expressed, genes (Fig. 2). The younger superfamilies DC and CT, which we suggest expanded comparatively early within the nematode lineage, have a similar genomic organization. The SX superfamily comprises both dispersed genes, mostly encoding proteins with catalytic domains, and several local gene clusters (discussed below) encoding simpler proteins with SX modules alone (6).

Figure 2

Genomic regions of selected gene families and superfamilies illustrating an inverse relation between gene expression and clustering. Regions from gene families and superfamilies representative of varying ages were compared by using a window of 13 genes centered on the target gene in 5′ to 3′ orientation. Intracellular members of the IG superfamily are not included. Plots for cuticle collagens, as well as chitinases outside the local gene clusters (41), are available in (57) and Web figure 2 (13), with the mean numbers of cDNAs in the current EST databases (± SD) for all gene families and superfamilies. For each family, bars on the upper panel indicate the percentage of genes at relative positions “– 6” to “+6” matching the target gene at position “0” by structural family or orientation, respectively. Bars on the lower panel indicate the percentage of genes with one or more cDNAs in the current EST databases (17), or known phenotypic alleles, respectively. The error symbols above each bar indicate the SD for binomial sampling. For IG superfamily, n = 45; for DC superfamily,n = 50; for the C01B7.7 family, n = 61; for the chitinases (clusters), n = 25.

Remarkably, a majority of potentially nematode-specific genes occur in large families, some with over 200 members in C.elegans (1, 39). Several gene families are implicated in structures and processes important to all nematodes, for example, collagenous cuticle. Although these families clearly expanded within the nematode lineage, until other complete genomes are available for comparison, it remains possible that their founding members originated earlier. Indeed, several of the largest families in C. elegans are evidently recent expansions of individual members of more ancient families, for example, chitinase, glutathionine-S-transferase, nuclear receptor, SCP (TPX), serpentine receptor, and UDP-glucuronyl transferase.

Cuticle collagens form one of the largest, and possibly oldest, nematode-specific gene families (40). Most of these 160 genes are represented in C. elegans EST databases and many have known mutations affecting body morphology. The family is dispersed throughout the genome, no cluster larger than four genes, at 128 sites. The flanking regions are enriched in highly expressed genes and mutant loci (Fig. 2). Why so many genes? Many isoforms are expressed in characteristic order during each molt cycle to create a layered cuticle; others provide stage- or region-specific modifications. Requirements for rapid, synchronous synthesis of large amounts of mRNA may select for increased copy number. Why are the gene products so similar? Requirements of triple-helix formation and polymerization may impose structural constraints on these chains that belie their true age. Like the DC and CT superfamilies, we suggest the cuticle collagen gene family expanded early in nematodes and is now maintained largely by independent selection on each member.

Most multigene families in C. elegans are strongly clustered within the genome. Selecting highly similar gene pairs, Semple and Wolfe (9) compared the relative spacing and orientation for 2929 duplicated genes representing 655 families in Wormpep release 12. Local gene clusters with mixing and inversion (discussed below), not pure tandem repeats or unlinked duplications, dominate the aggregate distribution of gene families in this large sample. Using dot-matrix and BLAST comparisons, we examined several large gene clusters in detail, finding frequent examples of recent gene duplication or conversion. Two representative gene families, C01B7.7 and M176.8 (chitinase), are summarized in Figs. 2 and3. Compared with cuticle collagens and various superfamilies, these apparently younger families are less evenly dispersed through the genome, occurring in a few large clusters, often with some isolated members (41). Within a cluster, repeated genes tend to have common orientation and regular spacing, but frequently, this pattern is disrupted by partial gene duplications and inversions. Sequence comparisons indicate these families expanded primarily through some mechanism of local duplication, i.e., adjacent genes were generally more similar than distant pairs, but sequences sometimes move to farther sites or even separate clusters. Often two or more unrelated gene families are intermixed within a single cluster. Examples of very recent duplications or conversion suggest genes but not intergenic regions were moved. Robertson (8) found a very similar pattern of gene expansion and movement, supported by comparisons with Caenorhabditis briggsae, in a study of serpentine receptor families.

Figure 3

Chitinase gene clusters. (A andB) In C. elegans, two local gene clusters on chromosome II, separated by 320 kilobase pairs, contain 25 chitinase genes or pseudogenes of the M176.8 family (red arrows), intermixed with seven members of the kin-15 protein tyrosine kinase family (green arrows). Arrows show gene orientation and extent of the predicted protein coding sequence; gene names and number of reported cDNAs in EST databases are shown below and above these arrows, respectively. The 3′ portions of R09D1.6 and R09D1.8 differ at just one nucleotide among 1497 base pairs, suggesting local sequence movement, possibly gene conversion, is an ongoing evolutionary process within these clusters. (B) The unc-4 egl-43 region, presumptive euchromatin immediately downstream of the chitinase clusters, encodes three acid phosphatases (blue arrows) and other, structurally unrelated genes (black arrows), many of which are highly expressed. (C) C. briggsae orthologs of unc-4, egl-43, and nine other genes from this region, lettered for purpose of comparison, are contained in the genomic contig G04A16::G17K04::C07B16::G47M11. Despite a large inversion, the order, orientation and spacing of genes labeled “c to i” and “o to r” have been conserved between these species (large arrows).

What molecular processes and selective forces shape the evolution of gene families? Contrary to previous belief, random gene duplication followed by independent, divergent evolution of the copies cannot explain the distribution of gene families and superfamilies inC. elegans. As they obtain more chances for duplication and divergence, this model predicts older gene families should tend to be larger, more divergent in structure and function, and more dispersed in the genome, than younger families. Eventually, such processes would produce protein superfamilies sharing only limited regions of homology. Contrary to these predictions, many old superfamilies appear relatively static, whereas large gene families are often young and dynamic.

What mechanisms are responsible for clustering of young gene families? Unequal crossovers and sequence drift could create tandem duplications where adjacent repeats are more similar than distant sequences (42). Occasional duplication or conversion to distant sites might drive concerted evolution of an entire family (43). We favor a role for mRNA intermediates. Gene clusters would expand through integration of cDNAs made from nascent transcripts in the same region. Similar chromosome-associated reactions, or RNA-mediated integration, have been proposed for retrotransposition of non–long terminal repeat (non-LTR) retrotransposable elements, short interspersed nuclear elements (SINEs) and processed pseudogenes in other eukaryotes (44). Frequent precise loss of individual introns during gene duplication (8), could be explained by limited processing of mRNAs before reverse transcription. Intermixing of gene families, inversions, and movements to farther sites might occur if the coupling of transcription and integration sites were relaxed. Indeed non-LTR retrotransposons can mediate gene movement and exon shuffling in cultured somatic cells (45). Finally, our model, which explains why gene duplications and conversions rarely extend into intergenic regions (9), suggests that transcribed and regulatory sequences generally have independent origins (46).

Is It Heterochromatin?

Eukaryote genomes are generally packaged into euchromatin and heterochromatin where the former regions are enriched in expressed genes and contain most known mutant loci (47). Can we identify bona fide euchromatin in the C.elegans genome sequence? Most members of the immunoglobulin superfamily have a single representative in C.elegans. In many cases, these proteins have been shown to be adaptive (conferring increased fitness) in insects or vertebrates, if not nematodes themselves. By inference, the C.elegans orthologs must be located in transcriptionally active regions of the genome, presumably euchromatin. Inspection of the regions flanking these genes reveals an assortment of structurally unrelated, often unique, genes of comparatively ancient origin (48). Consistent with the notion that these regions represent euchromatin, many of the flanking genes are themselves highly expressed, or else known through mutation to be adaptive (Fig. 2). Extrapolating to similar regions, most cuticle collagen genes are likely contained in euchromatin, and similarly for the DC superfamily, although no mutants have been found in the latter.

The fraction of predicted protein genes on each chromosome with known visible mutations correlates strongly, but negatively, with the fraction of genes in multigene families (9, 49). Inspection of these families reveals that clustered genes, which are found rarely, if at all, among characterized C. elegansmutants, account for this bias (Fig. 2). Moreover, they are highly underrepresented in the EST databases (1, 8, 17); this effect is not absolute as nearly all clusters examined have occasional EST hits (Fig. 3). The simplest explanation for these observations is that most local gene clusters are transcriptionally silent, but these data do not preclude significant levels of gene expression in a few cell types (50), or under unusual conditions, combined with functional redundancy among the gene products. Regardless, expression and selection of genes evolving in clusters must be qualitatively different from “typical” adaptive genes as described above.

Heterochromatin was first described cytologically as regions of late replicating DNA that remain condensed during interphase (47). Genetic studies indicated these same regions were impoverished in adaptive genes and undergo little recombination during meiosis. Early in situ hybridization studies revealed that heterochromatin is often enriched in simple repeated sequences, or “satellite” DNA. These observations lead to hypotheses that all heterochromatin might have a common, rather simple, sequence organization, and moreover, specific sequence repeats might themselves direct heterochromatin formation. However, subsequent studies, including recent analyses of long, representative genomic sequences, shown that heterochromatic regions are highly dynamic and remarkably heterogeneous in sequence. Several classes of transposable elements, including non-LTR retrotransposons and related SINEs, occur preferentially in heterochromatin (51). Moreover, clusters of recently duplicated genes or pseudogenes have been found in pericentromeric and subtelomeric heterochromatin of human chromosomes (52); it is unclear whether these duplicated genes are generally expressed or adaptive. Finally, juxtaposition or insertion into heterochromatin can silence otherwise active genes. In both insects and mammals, local duplication of transgenes or endogenous chromosomal sequences can itself cause heterochromatin formation and gene silencing (53).

Interphase nuclei in C. elegans have numerous regions of condensed heterochromatin, but little is known about their chromosomal arrangement. Does the genome sequence provide clues to chromatin organization at this level? In this species, spindle microtubules tether along the length of the chromosome during mitosis, rather than at a localized kinetochore (54); this distribution of kinetochore function could reflect a dispersal of centromeric heterochromatin along the chromosome. Unexpectedly, local gene clusters have several characteristics better ascribed to heterochromatin than euchromatin. Unlike dispersed gene families and superfamilies, most clustered genes predicted by genomic sequencing potentially fail two important criteria for adaptive genes, namely, expression of RNA products and observable phenotypes. Averaging only 2 to 3 kb in length, these repeated sequences generally form complex mixed arrays at multiple chromosomal sites (Figs. 2 and 3). Nonhomologous exchanges between inverted or unlinked duplications common in local gene clusters are a potential source of chromosome rearrangement. Although no studies have measured recombination rates within gene clusters, the stability of C. eleganschromosomes might be explained by the suppression of all recombination within these regions.

Like heterochromatin, local gene clusters are dynamic with frequent sequence movement through duplication or conversion (8). By contrast, the local arrangement of genes appears relatively stable in other regions of the C. elegans genome (55). Regions of synteny with C.briggsae, which separated 10 to 100 million years ago (6), have genome organization we ascribe to euchromatin, i.e., an assortment of structurally unrelated genes of comparatively ancient origin, many of which are highly expressed, or known through mutation to be adaptive. The unc-4 egl-43region, shown in Fig. 3, illustrates these features. Whereas this region has undergone large rearrangements, including inversion and translocation, there has been little local movement or duplication of sequences.

Conclusion

The C. elegans genome contains both ancient regions enriched in adaptive genes and more dynamic regions associated with emerging gene families. The expansion and collapse of complex gene clusters could reflect an ancient evolutionary process for the invention of new, potentially adaptive genes. Can concerted evolution speed acquisition and fixation of adaptive alleles or the elimination of useless members? Despite considerable interest, the molecular processes and selective forces underlying concerted evolution remain uncertain (43). The C. briggsae genome sequence, when completed, should help the interpretation of recent genome changes, including mutational mechanisms. We must also learn more about gene expression and function to understand the selective forces on genes evolving in families. Only selection on expressed sequences, whether direct or indirect, allows conservation of genes and exons.

We propose a simple, testable model for gene invention, namely, that heterochromatin is continually expanding through incorporation of cDNAs, creating local gene clusters, tandem protein repeats, and sometimes new exon combinations. At the euchromatin boundary, these sequences must succeed as adaptive genes, or more often, disappear completely. If heterochromatin is the primary site of gene invention today, was it always an organelle of chromosome growth? An attractive hypothesis is that heterochromatin arose during the transition from RNA- to DNA-based life as a mechanism for incorporating cDNA into chromosomes (46). In primitive chromosomes, the chromatin structure and enzymatic activities needed for converting RNA-to-DNA and incorporating the product were concentrated at specialized regions that persist today, near telomeres and centromeres, as heterochromatin (56).

  • * To whom correspondence should be addressed. E-mail: hutter{at}mpimf-heidelberg.mpg.de

  • Present address: Laboratory of Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.

REFERENCES AND NOTES

View Abstract

Stay Connected to Science

Navigate This Article