Special Reviews

Zinc Fingers in Caenorhabditis elegans: Finding Families and Probing Pathways

See allHide authors and affiliations

Science  11 Dec 1998:
Vol. 282, Issue 5396, pp. 2018-2022
DOI: 10.1126/science.282.5396.2018

Abstract

More than 3 percent of the protein sequences inferred from theCaenorhabditis elegans genome contain sequence motifs characteristic of zinc-binding structural domains, and of these more than half are believed to be sequence-specific DNA-binding proteins. The distribution of these zinc-binding domains among the genomes of various organisms offers insights into the role of zinc-binding proteins in evolution. In addition, the complete genome sequence ofC. elegans provides an opportunity to analyze, and perhaps predict, pathways of transcriptional regulation.

Less than 15 years ago, it was suggested that repeated sequences found in transcription factor IIIA (TFIIIA) of Xenopus might fold into structural domains stabilized by the binding of zinc to conserved cysteine and histidine residues (1–3). Klug and co-workers further noted that “it would not be surprising if the same 30 residue units were found to occur in varying numbers in other related gene control proteins” (1). This proposal proved remarkably prescient:Caenorhabditis elegans, for example, turns out to have more than 100 such proteins, and the number of domains per protein varies from one to perhaps as many as fourteen. Unanticipated at the time, though, was the fact that the zinc-binding motif found in TFIIIA is just one of many small zinc-binding domains, a number of which are involved in gene regulation. The properties of a few of these domains have been summarized recently (4).

Eukaryotes contain a much greater number of proteins with well-characterized zinc-binding motifs than do bacterial and archaeal organisms (Table 1). The complete genome of Caenorhabditis elegans (a metazoan), in conjunction with that of Saccharomyces cerevisiae (a yeast), presents a special opportunity to examine the range and diversity of these gene families in eukaryotes. Furthermore, because some of these zinc-binding motifs are sequence-specific DNA-binding proteins, the availability of nearly complete sequence information also permits a preliminary analysis of the distribution of potential binding sites within the entire genome. Such analyses may prove to be of value in deducing development control pathways and in more fully defining the characteristics of eukaryotic promoters.

Table 1

Zinc-binding domains were identified with HMMs and the HMMER program package version 1.8.4 (21, 38). Only motifs involved in DNA binding or in protein-protein interactions are considered here; enzymes that use catalytically active zinc sites are ubiquitous but were not examined. The C3H and DM HMMs were constructed from published sequence alignments, with the addition ofC. elegans ORF K08B12.2 to the DM alignment (20,39). All other HMMs were from the Pfam database (38). A threshold of 10 bits was used as the criterion for significance for all database hits reported here. The database ofC. elegans ORFs was current as of 10 July 1998 and is available with other supplementary information atwww.sciencemag.org/feature/data/985286.shl. The database of S. cerevisiae ORFs (orf_trans.fasta) was obtained fromgenome-ftp.stanford.edu/pub/yeast/yeast_ORFs. Databases forEscherichia coli (ecoli.faa) and Methanococcus jannaschii (mjan.faa) were obtained fromncbi.nlm.nih.gov/genbank/genomes/bacteria in their respective subdirectories. A general overview of the data sets and analysis is available atwww.sciencemag.org/feature/data/c-elegans.shl.

View this table:

The Cys2His2 Family

The zinc-stabilized domains of TFIIIA are known as “zinc fingers” or Cys2His2 domains. The consensus sequence for this family is (Phe, Tyr)-X-Cys-X2-4-Cys-X3-Phe-X5-Leu-X2-His-X3-5-His (5–7). In both C. elegans and the yeast S. cerevisiae, roughly 0.7% of all proteins contain one or more Cys2His2 zinc finger domains (Table 1). However, the distribution of these domains within proteins is rather different in the two organisms. In yeast, the majority of zinc finger proteins contain exactly two domains, and only a few (∼10%) have more than two. In contrast, there are more zinc finger proteins inC. elegans that have three or more Cys2His2 domains than there are proteins that have exactly two (Fig. 1) (8). On the basis of the sequences of mammalian and Drosophilazinc finger proteins, it appears that the distribution of Cys2His2 domains among C. elegansproteins is typical of multicellular organisms.

Figure 1

Distribution of finger domains among Cys2His2 zinc finger proteins in C. elegans and S. cerevisiae.

The GATA, LIM, and Hormone Receptor Families: Implications for Metazoan Evolution

The GATA domain, the LIM domain, and the DNA-binding domains from nuclear hormone receptors each include a four-cysteine zinc-binding domain that can be clustered into the same structural superfamily, and it is possible that they share a common evolutionary origin (Fig. 2) (9, 10). In addition to the Cys4 superfamily domain, LIM domains contain a similar LIM-specific Cys2HisCys zinc motif, whereas the hormone receptors have a second and distinct Cys4 domain. GATA proteins frequently contain a pair of Cys4 superfamily domains.

Figure 2

Schematic views of the zinc-binding regions from the GATA, LIM, and hormone receptor families.

Normalized to the number of genes in their respective genomes, the number of GATA and LIM domain homologs is similar in C. elegans and S. cerevisiae. In striking contrast, the hormone receptor family is completely absent in yeast but is the largest single family of zinc-binding domains in C. elegans. In fact, with over 200 family members, the hormone receptors make up nearly 1.5% of the entire coding sequence of C. elegans. The differences in the distribution of nuclear hormone receptors inC. elegans and S. cerevisiae may be relevant to the evolution of multicellular animals. As has been noted before, the evolution of hormone receptors may have been a key event in the development of cell-cell communication and the origins of multicellularity in the metazoa (11).

The ligand-binding domains of the hormone receptors have diverged considerably more than the DNA-binding domains. Applying the same criterion for significance to both the DNA- and ligand-binding domains of the hormone receptor family, only about 10% of the open reading frames (ORFs) that have a DNA-binding domain appear to have a ligand-binding domain. However, among genes containing hormone receptor DNA-binding domains, the scores for potential ligand-binding domains are typically higher than those seen in ORFs that do not have the DNA-binding domain. For example, over 40% of the DNA-binding domain ORFs have ligand-binding domain scores that exceed by more than 2 SD the mean score for ORFs that lack the DNA-binding domain. Furthermore, when we used a hidden Markov model (HMM) constructed from some of these top-scoring worm domains, over 90% of the DNA-binding domain ORFs (most of which were not used in constructing the HMM) now had ligand-binding domain scores that exceeded those for unrelated genes by more than 3 SD. We believe, therefore, that most of the hormone receptor homologs in C. elegans do have sequences related to the ligand-binding domain.

Identification of Genes That May Be Regulated by the TRA-1A Zinc Finger Protein

Several of the common zinc-binding motifs function as sequence-specific DNA-binding domains, including the Cys2His2 zinc fingers. With a complete genome sequence in hand, a comprehensive analysis of potential binding sites becomes possible; this, in turn, raises the possibility that certain aspects of transcriptional regulation might be predictable on the basis of genomic sequence analysis. As a test case, we conducted a preliminary analysis of potential TRA-1A–binding sites in the C. elegans genome. TRA-1A, which is a product of the tra-1gene, was chosen for this analysis because its binding specificity has been well characterized (12) and because it belongs to a subfamily of Cys2His2 proteins that is of exceptional biological interest. TRA-1A is a close homolog ofDrosophila cubitus interruptus (segment polarity gene) and of human GLI (oncogene) and GLI3 (cranio-facial development) (13). Furthermore, a crystal structure has been determined for the zinc finger region from GLI bound to a DNA site (14).

In C. elegans, tra-1 activity is necessary for animals to develop into females or hermaphrodites (15–18). As the last in a line of global regulators of sexual development, tra-1 controls the specialized pathways that lead to sex-specific development in different tissues. One way tra-1 activity could lead to female animals is by repressing genes whose expression would otherwise lead to the development of male-specific features. Expression of mab-3, for example, leads to the development of male-specific peripheral sense organs in males, but in females and hermaphrodites tra-1activity blocks this pathway by reducing mab-3 mRNA levels (19, 20). This reduction in the steady-state level of mab-3 transcripts could be due to direct repression by TRA-1A.

Using DNA sequences that were selected in vitro for tight binding to TRA-1A, the binding-competent gene product of tra-1, we constructed an HMM for TRA-1A–binding sites (12,21). HMMs provide a probabilistic definition of binding sites on the basis of the nucleotide frequencies observed experimentally at each position and are presumably a more realistic predictor of in vivo binding sites than are simple consensus sequences. We used the TRA-1A HMM to identify about 1300 potential binding sites in the C. elegans genome (22). The distribution of these sites within 5′ extragenic regions differs from random distributions in the existence of five genes that have three or more upstream TRA-1A sites (Fig. 3A). Strikingly, mab-3 is among this very small subset of genes, which supports the idea that mab-3 transcription is repressed directly by the binding of TRA-1A.

Figure 3

(A) Distribution of potential TRA-1A–binding sites. Of the 1299 TRA-1A sites in C. elegans (22), 561 are in intergenic regions and no more than 4 kb from the first predicted exon in the gene. The number of genes that have 0, 1, … , 5 upstream TRA-1A sites is indicated by the black bars. As a control, 1299 random sites were picked within the genome, and their distribution with respect to the ORFs was determined by the same criteria. This random distribution was generated 100 times, and the mean and standard deviation for the number of genes having a given number of sites were calculated. The stippled bars show the mean random value, and the error bars indicate the standard deviation. Five genes had three or more upstream TRA-1A sites, which is highly significant according to the randomized distributions. The five genes are C03C11.2, Y95B8A_75.a, Y53C12B.5a (mab-3), K10G6.1 (lin-31), and F08F3.9. Gene names and predicted exons are from genome feature files received 23 June 1998 from J. Spieth of theC. elegans Genome Center, Washington University, St. Louis, Missouri. (B) Distribution of potential MAB-3–binding sites based on site selection experiments (25). The random distributions were calculated as described for TRA-1A but with the number of potential MAB-3–binding sites found, 1346. The gene with three upstream sites is F13D11.2. BLAST searches with this ORF indicated high sequence similarity to Hunchback homologs, and a reciprocal search of the C. elegans genome with theDrosophila Hunchback sequence showed that this is the only ORF that is strikingly similar (29).

Some of the other genes besides mab-3 that have a large number of potential TRA-1A sites might also be regulated by TRA-1A. The most interesting of these other genes is lin-31(Fig. 3A) (23). Like mab-3, lin-31 is required for development of sex-specific tissues and is a putative transcription factor. Unlike mab-3, though,lin-31 is required for development of a lineage that is female- and hermaphrodite-specific rather than male-specific (23). Thus, if tra-1 does regulatelin-31 (which remains to be shown experimentally), it might well be expected to activate transcription rather than repress it. The remaining three genes identified in the upstream binding site analysis are not related to sexual development in any obvious way. However, one is a TATA-binding protein associated factor (TAF) and another is homologous to a protein with antiproliferative activity (24). Whether expression of these genes is affected bytra-1 is unknown at present.

The mab-3 gene product is itself a putative transcription factor containing a novel zinc-binding motif (20). Because our analysis of the C. elegans genome “predicted” the regulation of mab-3 by TRA-1A, we extended this analysis by attempting to predict genes that might be regulated in turn bymab-3. Data from unpublished binding site selection experiments (25) were used to construct an HMM that was then used to search the C. elegans genome with a cutoff score chosen to yield a number of sites similar to that found in the TRA-1A analysis. As shown in Fig. 3B, the distribution of these binding sites upstream of C. elegans genes indicates that fewer genes have significantly large numbers of sites than was the case with TRA-1A. Nevertheless, there is an excess of genes with two sites over the number expected from a random distribution, and there is one gene with three upstream sites. Intriguingly, the gene with three upstream sites is the C. elegans ortholog of hunchback, aDrosophila gene that encodes a Cys2His2 zinc finger transcription factor important in development.

A few genomic sequences to which Drosophila Hunchback binds in vitro have been identified (26, 27). On the basis of these data, we could perhaps extend our predictions for this regulatory pathway one more step by assuming that C. elegans Hunchback recognizes the same binding sites asDrosophila Hunchback. However, our experience with MAB-3 and its close Drosophila homolog DSX indicates that such predictions should await experimental determination of binding site specificity for the C. elegans protein. The in vivo binding specificity of Drosophila DSXM (the male-specific product of the doublesex gene) must be fairly similar to that of MAB-3, because ectopically expressed DSXM can functionally replace mab-3 to some extent (20). Furthermore, there are sequences to which both proteins will bind in vitro with reasonable affinity (25). Nevertheless, the distribution of binding sites obtained by in vitro selection experiments is quite different for the two homologs (25, 28), and use of Drosophila DSX binding site data (instead of the MAB-3 data) gives a distribution of predicted binding sites in the C. elegans genome that is not substantially different from a random distribution (29).

Potential Autoregulation by the GATA Homolog ELT-1

As a final example of how binding site distributions can be used to assess regulatory issues in a complete genome, we considered the C. elegans GATA family member elt-1(30). Spieth et al. suggested previously that theelt-1 gene may be autoregulated because there are multiple matches to a consensus GATA-binding site [(A/T)GATA(G/A)] within a few hundred base pairs upstream of its initiation codon (30). However, because there are more than 200,000 matches to the consensus GATA site in the C. elegans genome, the question arises as to whether the number of GATA sites upstream ofelt-1 is unusually large. In an analysis similar to that described above for the tra-1 and mab-3 gene products, we searched the C. elegans genome for matches to the canonical GATA recognition site and determined the number of these sites within 500 base pairs of the first predicted exon for each gene. This distribution was then compared with a set of random distributions. As shown in Fig. 4, the number of sites associated with elt-1 does, in fact, place it among a set of 25 genes that have an unusually large number of such sequences.

Figure 4

Distribution of ELT-1 (GATA)–binding sites (black bars). These data were obtained in a manner analogous to that described for TRA-1A (22) (Fig. 3) except that only sites within 500 bp 5′ of the first exon were considered. Genes with between one and six GATA sites have been grouped together because the number of genes with N GATA sites does not become significantly larger than the number of genes with N randomly distributed sites until N ≥ 7. The mean values from 15 random distributions are indicated by the stippled bars; error bars indicate the standard deviations.

The Challenges of Regulatory Pathway Prediction

Counting the number of upstream sites that exceed some threshold for similarity to a binding sequence is a rather simple-minded approach to predicting transcriptional regulation and one that will undoubtedly lead to some incorrect predictions. Among the complicating factors that are not captured by simple enumeration of binding sites are the spacing and orientation of binding sites, cooperative interactions among different proteins, and competition for binding by proteins with similar or overlapping DNA-binding sites. Despite these caveats, much of what we have inferred about the regulation of most genes is based precisely on this kind of simple identification of upstream sequence elements. At the very least, the existence of complete genomic information provides, for the first time, the means to evaluate the statistical significance of such sites without having to make assumptions about the composition and gene distribution of the genome. Furthermore, the example ofmab-3 regulation by tra-1 offers some encouragement that even this simple approach to the problem could prove fruitful in some cases.

As important as the prospects for prediction is the use of the genome sequence in understanding the complexities of transcription initiation control and in interpreting genome-wide transcription studies (31). If we are to really understand how the transcriptional regulation of nearly 20,000 genes is coordinated in C. elegans, as opposed to simply cataloging genes and the proteins that affect their expression, then computational analysis of the genome will be an indispensable adjunct to experimental studies.

The modular nature of Cys2His2 zinc finger proteins and the relatively simple way in which some members of the family bind DNA had previously led to the idea that simple rules might be found for predicting the sequence specificity of zinc finger proteins (32). Indeed, a few rules have been developed and have proven useful in designing proteins to recognize particular DNA sequences (32, 33). However, natural zinc finger proteins are too diverse in terms of both their presumptive DNA-contact residues and the length and sequence of the linkers that connect the fingers for these rules to be usefully applied in a general way in the prediction of specificity.

The probing of regulatory pathways will clearly require careful experimental determination of binding site preferences for all classes of DNA-binding proteins. However, with the acquisition of more (and better) binding data and with the availability of high-throughput technologies to measure transcript levels of essentially all the genes in an organism, the computational analysis of transcriptional regulation is sure to progress rapidly.

Conclusions

Zinc-binding units such as the Cys2His2 zinc finger domains are present in a large number of gene products, representing some of the largest protein families in the C. elegans genome. Although bacteria and archaea do contain some proteins that bind zinc, they appear to lack the large families of zinc-binding domains like those found in yeast, worms, and other eukaryotes. This suggests that these zinc-binding domains may not be truly ancient units but instead evolved later as genome size and cell sophistication increased. Of particular importance may have been the evolution of efficient mechanisms for zinc homeostasis. Yeast and other eukaryotes have recently been shown to contain proteins for importing and exporting zinc as well as other potential components of such a system (34, 35). If bacteria and archaea did not evolve systems for zinc homeostasis, then the use of zinc-dependent proteins for gene regulation in these organisms may have been disadvantageous.

Comparison of the two available eukaryotic genomes reveals some striking differences. Although several families, such as the Cys2His2 zinc finger, RING finger, and nucleocapsid domains, are of comparable size, particularly when normalized for genome size, other families show extremely skewed distributions. As noted above, the hormone receptor superfamily is the largest single family of zinc-binding domains found in C. elegans, yet these proteins are not found in yeast. Another family, the zinc cluster proteins, typified by GAL4, is the largest family in yeast, yet only one putative family member (not authenticated) is encoded by the C. elegans genome.

Because some of the zinc-binding domains function by sequence-specific interactions with DNA, the completed genome has facilitated preliminary attempts to identify potential gene regulatory pathways in silico. Similar methods could be applied to other DNA-binding proteins of known binding specificity. Further development of such analysis procedures may provide important insights into the myriad gene regulatory pathways that are necessary for the development and growth of multicellular organisms.

REFERENCES AND NOTES

Stay Connected to Science

Navigate This Article