Research Article

Arabidopsis Transcription Factors: Genome-Wide Comparative Analysis Among Eukaryotes

See allHide authors and affiliations

Science  15 Dec 2000:
Vol. 290, Issue 5499, pp. 2105-2110
DOI: 10.1126/science.290.5499.2105


The completion of the Arabidopsis thaliana genome sequence allows a comparative analysis of transcriptional regulators across the three eukaryotic kingdoms. Arabidopsis dedicates over 5% of its genome to code for more than 1500 transcription factors, about 45% of which are from families specific to plants.Arabidopsis transcription factors that belong to families common to all eukaryotes do not share significant similarity with those of the other kingdoms beyond the conserved DNA binding domains, many of which have been arranged in combinations specific to each lineage. The genome-wide comparison reveals the evolutionary generation of diversity in the regulation of transcription.

Regulation of gene expression at the level of transcription influences or controls many of the biological processes in a cell or organism, such as progression through the cell cycle, metabolic and physiological balance, and responses to the environment. Development is based on the cellular capacity for differential gene expression and is often controlled by transcription factors acting as switches of regulatory cascades (1). In addition, alterations in the expression of genes coding for transcriptional regulators are emerging as a major source of the diversity and change that underlie evolution (2).

With the completion of the Arabidopsis thaliana genome sequence, the entire complement of genes coding for transcription factors from a plant can be identified and described. Together with the three other eukaryotic genomes that have already been sequenced, it also allows investigation of the similarities and differences in transcriptional regulators among the three eukaryotic kingdoms: plants, animals (Caenorhabditis elegans and Drosophila melanogaster) (3, 4), and fungi (Saccharomyces cerevisiae) (5). We present such a description and analysis here.

Gene Content and Organization

To characterize the entire complement of transcription factors encoded by the genomes of Arabidopsis,Drosophila, C. elegans, and S. cerevisiae, we used a comprehensive list of proteins, domains, and motifs to query the corresponding sequence databases. Transcription factors are usually defined as proteins that show sequence-specific DNA binding and are capable of activating and/or repressing transcription. Although most of the proteins and protein families that were considered in our study fit these criteria, we have also included some other types of transcriptional regulators. Most known transcription factors can be grouped into families according to their DNA binding domain (6). Protein domains that are sometimes present in transcription factors, but not necessarily associated with them, have not been included in this genome survey, for example, some zinc coordinating motifs that either are involved in protein-protein interactions or have not yet been functionally characterized.

We searched the Drosophila, C. elegans, and yeast encoded protein complements (proteomes) using BLAST and motif-finding programs (7). Because the complete predicted proteome of Arabidopsis was not available at the time of the analysis, we used the entire set of genomic sequences (7).

The Arabidopsis genome codes for at least 1533 transcriptional regulators, which account for ∼5.9% of its estimated total number of genes (Table 1). We identified 635, 669, and 209 transcriptional regulators in the proteomes of Drosophila, C. elegans, and yeast, respectively (4.5, 3.5, and 3.5%). Thus, the Arabidopsiscontent of transcription factors is 1.3 times that ofDrosophila and 1.7 times that of C. elegans and yeast. These results represent an underestimate of the total number of transcription factors in these organisms. Approximately 40 to 50% of the proteins encoded by each of those genomes cannot be assigned to functional categories on the basis of sequence similarity to proteins of known function (3, 8–11). Some of those uncharacterized proteins are expected to be transcriptional regulators (12, 13). The large number and diversity of transcription factors in Drosophila were proposed to be related to its substantial regulatory complexity (4). Applying the same logic to Arabidopsis suggests that the regulation of transcription in plants is as complex as that in Drosophila. In contrast to Drosophila and C. elegans, for which a sizable (>25%) fraction of their known transcription factors have been characterized genetically (14), only ∼5% of those from Arabidopsis have been defined by mutation analysis (15).

Table 1

Content of transcriptional regulator genes in eukaryotic genomes. The number of genes in each of the eukaryotic genomes is given as an approximate number. This is because the number of genes predicted at the time that a genome is sequenced is always an estimate that is refined over time (7).

View this table:

Arabidopsis contains many tandem gene duplications and large-scale duplications on different chromosomes, which might account for >60% of the genome (9, 10, 16). Whereas some of these duplications have been followed by rearrangements and divergent evolution, up to 40% of the Arabidopsis genes might comprise pairs of highly related sequences (16). In that respect, Arabidopsis is similar to the three other eukaryotic organisms. The S. cerevisiae genome is the result of a complete ancient genome duplication that was followed by extensive gene rearrangements and deletions (17). In yeast, ∼30% of the genes form duplicate gene pairs. Similarly, duplicated genes account for ∼48 and ∼40% of the total gene content of C. elegans andDrosophila, respectively (11).

All of the Arabidopsis transcription factor gene families are scattered throughout the genome. On average, closely related genes account for ∼45% of the total number in the major families (Table 2) (18). Gene duplications on different chromosomes are most common (∼65%), but duplicated genes are also frequently found at large distances in the same chromosome (∼22%) as well as organized in tandem repeats (∼13%) (19). Clusters of three or more highly related genes are very rare (Table 2).

Table 2

Gene duplications in Arabidopsistranscription factor families. The major families ofArabidopsis transcription factors were analyzed for the presence of pairs or groups of highly related genes (18). The families analyzed together comprise over 1000 genes. Tandem duplications are arbitrarily defined as those that occur within a sequence distance of 50 kb. If two genes are duplicated in the same chromosome but reside >50 kb apart from each other, they are counted in the “Duplications in the same chromosome” column. (Zn) indicates a zinc coordinating DNA binding motif.

View this table:

Transcription Factors Across the Eukaryotic Kingdoms

Two features stand out when comparing the Arabidopsiscomplement of transcriptional regulators with that of the other organisms (Table 3). First, <22% of the Arabidopsis transcription factors are zinc-coordinating proteins [belonging to several different families that are thought to have evolved independently (20)]. In contrast, zinc-coordinating proteins constitute most of the transcription factors in the three other eukaryotes: ∼51% inDrosophila, ∼64% in C. elegans, and 56% in yeast. Second, in Arabidopsis, there is no single family of transcription factors that has been so disproportionately amplified as the nuclear hormone receptors in C. elegans (∼38% of its transcription factors), the C2H2 zinc finger proteins inDrosophila (∼46%), or the C6 and C2H2 families in yeast (∼25% each one). The three largest families of transcription factors in Arabidopsis, AP2/EREBP (APETALA2/ethylene responsive element binding protein), MYB-(R1)R2R3, and bHLH (basic helix-loop-helix), each represent only ∼9% of the total, and there are several other families with comparable numbers of genes.

Table 3

Eukaryotic transcriptional regulators. Number of transcriptional regulators in Arabidopsis(A.t.), Drosophila (D.m.), C. elegans (C.e.), and S. cerevisiae(S.c.), classified by families on the basis of sequence similarity. The table is nonredundant: proteins are counted only once, regardless of whether they have more than one signature motif. The way in which proteins combine different DNA binding motifs were organized into families is reflected in Fig. 1. Families that are specific to one lineage are indicated in color. Families are listed under “Transcription factors” or “Other transcriptional regulators,” as described in the text. However, this distinction is not without problems (for example, the ARID and HMG-box families). Information about the signature motif(s) or sequences that define each family is provided as an InterPro (IPR) or GenBank accession number (56). (Zn) indicates a zinc coordinating DNA binding motif. In the bHLH class, only proteins with a discernible basic region were included. “Other” includes some single-copy genes and small families that are not individually mentioned in the text. The results of the database searches (P, motif searches; B, BLAST) and sequence comparisons were inspected by eye. The numbers reported here might therefore differ from other large-scale classifications that are performed automatically (11).

View this table:

Each eukaryotic lineage has its own set of particular transcription factor families and genes [comparing such a small number of genomes represents a limitation for this type of analysis (21)] (Table 3). The lineage-specific families are of interest from an evolutionary point of view. According to molecular phylogenetic analyses, plants, animals, and fungi all diverged from a common ancestor during a short period of time, ∼1.5 billion years ago (15). Thus, it would be expected that most of the transcription factor families would either be shared by the three lineages, if they were present in the common ancestor, or specific to each lineage, if they arose independently following divergence. This is indeed the case (Table 3). Members of lineage-specific families represent 45% of the Arabidopsis transcription factors, 47% in C. elegans, and 32% in yeast (but only 14% inDrosophila, because of its extensive use of the C2H2 zinc finger proteins). Families that are present in all four organisms account for most of the remaining transcription factors in each case.

There are, however, a few exceptions to this expected pattern: some genes and gene families are present in two of the three lineages. Transcription factors and transcription factor families that are present in Drosophila, C. elegans, and yeast (but are absent from Arabidopsis) include the SOX/TCF (SRY-related HMG box/T cell factor) group, the fork head–type/winged-helix proteins, and homologs of the human transcription factor RFX1 (Table 3). The SOX/TCF group, which includes developmental regulators like human SRY (sex-determining region Y) and TCF and the yeast hypoxic-gene regulator ROX1, forms part of the HMG-box (high-mobility group) superfamily of proteins (22). In contrast to other HMG-box proteins that act as architectural components of chromatin and have no sequence specificity on their own, the SOX/TCF factors show sequence-specific DNA binding and transactivation activities. There are 14 genes in theArabidopsis genome encoding HMG box–containing proteins, but phylogenetic analyses indicate that none of these proteins belong to the SOX/TCF group (15).

In contrast to the examples described above, there does not appear to be any case of transcriptional regulators that are present in both yeast and Arabidopsis but absent from animals. This distribution of genes and gene families in the three eukaryotic lineages is in agreement with the notion that animals and fungi are more closely related to each other than to plants (23). There are at least three classes of transcription factors that are present in plants and animals but absent from yeast: TUBBY-like (TUB), CPP-like (cystein-rich polycomb-like protein), and E2F/DP proteins (13, 24, 25) (Table 3). It remains to be determined whether these classes of genes were specifically lost from the S. cerevisiae genome or if they are really absent from the fungal lineage.

There are many transcription factor families that are found only in plants, some of which have been greatly amplified. These include the AP2/EREBP (26), NAC (27), and WRKY families (28); the trihelix DNA binding proteins (29); the auxin response factors (ARFs); the Aux/IAA proteins [which do not bind to DNA by themselves, but interact with the ARF proteins (30)]; and other smaller families (Table 3). Similarly, animals and yeast have many families of transcription factors that are not found in plants (Table 3).

A lingering question when considering protein families that appear to be exclusive to one lineage is whether their signature domains are true evolutionary innovations or whether their relationships with other proteins have been blurred because their amino acid sequences (but not their three-dimensional structures) have diverged substantially over time. Some of the plant-specific families of transcriptional regulators are characterized by domains that appear to be genuine novelties. For example, the AP2 domain exhibits a new mode of DNA recognition by a β-sheet structure (31). Other transcription factors classified as specific to plants, however, might be related to proteins found in other organisms. The plant-specific GRAS proteins might be distant relatives of the animal-specific STATS, based on a similar arrangement of related functional domains (32). The trihelix DNA-binding domain, present only in plants, might have evolved from the MYB domain, found in all eukaryotes (29).

The two transcription factor families that have been more substantially amplified in Arabidopsis, as compared to animals and yeast, are the MYB and the MADS families. The MYB motif consists of a helix-turn-helix structure with three regularly spaced Trp residues. InArabidopsis, almost all of the MYB proteins belong to the MYB-R2R3 class (131 members): they contain two imperfect repeats of the MYB motif (33). MYB-R1R2R3 proteins, which are the norm in animals, are rare in Arabidopsis (five proteins). The plant-specific R2R3 organization is thought to have evolved from an R1R2R3-type ancestral gene from which the first repeat was lost (34). Because the plant MYB-R1R2R3 proteins are more closely related to the animal MYB proteins than to the plant proteins of the R2R3 type, it has been suggested that they might have functions related to those of the MYB proteins in animals, such as the control of cell proliferation (34, 35). Conversely, MYB-R2R3 proteins might have evolved to regulate processes specific to plants, including secondary metabolism, responses to plant hormones, and the identity of specific cell types.

In addition to the MYB-(R1)R2R3 proteins, Arabidopsiscontains additional transcription factors characterized by a more divergent MYB domain, which is present either as a single copy or as a repeat. These proteins form a heterogeneous group and are often referred to as “MYB related.” For the purpose of clarity, we have divided the Arabidopsis MYB-related proteins into several subclasses in Fig. 1(15).

Figure 1

Relationships and domain shuffling among the different Arabidopsis transcription factor families. Gene families are represented by circles, whose size is proportional to the number of members in the family. Domains that have been shuffled and that therefore “connect” different groups of transcription factors are indicated with rectangles, whose size is proportional to the length of the domain. DNA binding domains are colored; other domains (usually protein-protein interaction domains) are shown with hatched patterns. Dashed lines indicate that a given domain is a characteristic of the family or subfamily to which it is connected. Gene names are written in italics. Whereas many of the indicated domain-shuffling events are specific to plants, others likely predate the appearance of the three distinct eukaryotic lineages. For an expanded version of this figure and the information that was used to construct it, see supplemental material (15).

More distant but also related to the MYB superfamily is a previously unidentified group of proteins that we propose to name “GARP,” after maize GOLDEN2, the ARR B-class proteins fromArabidopsis, and Chlamydomonas Psr1 (36–39) (Fig. 1). These proteins appear to be involved in plant-specific processes: GOLDEN2 controls the differentiation of a photosynthetic cell type of the maize leaf, whereas Psr1 is a regulator of phosphorus metabolism.

Arabidopsis also contains many more heat shock transcription factors (HSFs) than does Drosophila, C. elegans, or yeast. Plant HSFs exhibit structural and functional characteristics specific to that lineage (40, 41).

For those transcription factor families that are common to all eukaryotes, how similar are the Arabidopsis proteins to those from the other organisms? Each Arabidopsistranscription factor was compared to the proteomes ofDrosophila, C. elegans, and yeast by using the BLASTX and BLASTP programs. The analysis revealed thatArabidopsis transcription factors do not share significant similarity with those from the other lineages, except in the conserved DNA binding domains that define the respective families. The onlyArabidopsis proteins that showed similarity beyond the threshold of significance established in the comparison (42) were some homologs of the HAP3 subunit of the CCAAT-box binding factor and a MYB-related protein known to be homologous to the S. cerevisiae CEF1 and S. pombe Cdc5 proteins (43, 44).

Domain Shuffling

The modular nature of transcription factors and the importance of domain shuffling in protein evolution are both well established. The characterization of the entire complement of Arabidopsistranscription factors allows consideration of the extent of domain accretion, shuffling, and divergence in these proteins and reveals the relationships among the different families at a genome-wide scale (Fig. 1).

Shuffling of some of the DNA binding domains that are present in all eukaryotes has generated novel transcription factors with plant-specific combinations of modules. This is well illustrated by the homeodomain proteins. In ∼50% of the members of theArabidopsis homeobox family, the homeodomain is followed by a leucine zipper (Fig. 1). This combination of motifs is not observed in the yeast or animal homeodomain proteins. The onlyArabidopsis homeodomain proteins that have an additional motif also found in animal homeodomain proteins are those of the KNOX class, which contain a MEINOX domain (Fig. 1) (45). On the other hand, homeodomains in animals are associated with a large variety of motifs, such as the paired and POU-specific domains, the LIM motif, or C2H2 zinc fingers, in combinations that are not present inArabidopsis. Some of these domains (paired and POU) are specific to animals.

Other examples of plant-specific arrangements of common domains include the MADS, YABBY, and ARID families. The ARID (for AT-rich interaction domain) motif is found in animals in a variety of developmental and cell-cycle regulators, like the Drosophila Dead ringer and Osa proteins (46). In animal ARID proteins, that domain is combined with other motifs, like PHD fingers or the jumonji domain (47). In the Arabidopsis ARID proteins, the ARID domain is associated with an HMG box, whereas PHD fingers and the jumonji domain form other combinations (Fig. 1). Some animal ARID proteins, like Bright, exhibit sequence-specific DNA binding, whereas others, like Osa, do not. Osa, however, modulates the activity of the SWI/SNF Brahma complex to promote the activation of specific target genes (46).

MADS domain proteins in plants were first identified as regulators of floral organ identity and have since been found to control additional developmental processes, such as meristem identity, root development, fruit dehiscence, and flowering time (48, 49). A characteristic of the plant MADS domain proteins that sets them apart from their animal and fungal counterparts is a modular organization containing a distinct coiled-coil domain (K box). TheArabidopsis genome sequence, however, has revealed that there is an additional class of plant MADS domain proteins in which the K box is absent (50). Phylogenetic analyses indicate that a gene duplication event, ancestral to the divergence of plants and animals, generated two MADS-box gene lineages that are now present in all eukaryotes. In plants, one lineage resulted in MADS proteins with a K box, whereas the other resulted in proteins that lack it (50). This conclusion, which was based on sequence phylogeny, is also supported by the structure of the genes. K box–containing MADS-box genes have multiple exons, the MADS box being completely encompassed in one of them. However, analysis of theArabidopsis genomic sequence indicates that MADS-box genes lacking a K box have a simpler structure, with fewer or no introns.Drosophila and C. elegans each have two MADS-box genes, one per lineage. In Arabidopsis, in which at least 82 MADS-box genes can be identified, both classes have been substantially amplified (Fig. 1).

It has been proposed that the complexity in protein domain organization increases with the complexity of the organism (11). The above examples of domain shuffling and accretion suggest that, at least among transcription factors, plants are as complex as animals in this respect.

Together with the lineage-specific generation of novel classes of transcription factors or the specific amplification and divergence in one lineage of a common type of regulator, development of novel functions might also result from the organization of transcription factors in novel networks of protein-protein interactions, perhaps as a consequence of domain-shuffling events. For example, the animal-specific Smad proteins depend on interactions with other transcription factors to compensate for their relatively low DNA binding sequence specificity (51). These factors include the vertebrate winged-helix protein Fast-1 (winged-helix proteins are found in animals and in fungi) and the Xenopus homeodomain proteins Mixer and Milk. The Smad–Mixer/Milk interaction has been proposed to mediate mesoendodermal induction (52). All of these Smad-interacting proteins of different classes (Fast1, and Mixer and Milk) share a short Smad-interaction motif (52) that appears to be specific to vertebrates: it is not found inDrosophila, C. elegans, Arabidopsis, or yeast proteins. More examples of this kind will be uncovered as the networks of protein-protein interactions among transcription factors are deciphered.

Functional Diversity

The differences in transcription factor content, sequence, and structure among the three eukaryotic lineages are also accompanied by functional diversity. Equivalent or similar biological functions can be controlled by different families of transcription factors in each lineage. Conversely, DNA binding domains that are found in all three eukaryotic kingdoms often control different functions in each one. Developmental regulators illustrate this point. There are also cases, however, in which the involvement of a gene or family in a particular biological function has been maintained across the three lineages (for example, the HSF family).

Pattern formation is an obligate requirement in the development of complex multicellular organisms. In animals, determination of regional identity and specification of the body plan are achieved through the localized activities of homeodomain proteins. Similar functions in plants, meristem patterning and floral organ identity determination, rely on the domain-specific expression of a subset of MADS-box genes (48, 49). Therefore, two different transcription factor families have been used for similar developmental functions in the two lineages.

Patterning depends on a system of axes. The dorsoventral polarity ofDrosophila has been likened to the dorsoventral asymmetry of zygomorphic flowers and could also be conceptualized as being similar to the adaxial-abaxial polarity of the plant lateral organs. In all of these cases, polarity is established through the regionally localized expression or accumulation of transcription factors, but those belong to different classes. Floral asymmetry in Antirrhinum is dependent on the activities of CYC and DICH, two members of the plant-specific family of transcription factors TCP (53,54). Transcription factors of another plant-specific family, YABBY, are involved in establishing the adaxial-abaxial polarity of the plant lateral organs, together with other genes like PHAN, a MYB-related protein (55). In Drosophila, embryonic dorsoventral polarity is established through a gradient of Dorsal, a transcription factor of the NF-κB/Rel/Dorsal group (NF-κB, nuclear factor κB). NF-κB/Rel/Dorsal proteins are found in Drosophila and mammals but not in C. elegans, yeast, or plants.


Each eukaryotic lineage has invented a sizable fraction of its own transcriptional regulators. DNA binding domains that are conserved in sequence and structure have been rearranged in different ways to create novel proteins. The degree of domain shuffling among transcription factors is large. In many instances, families that are common to the three kingdoms have been used for different or novel processes in each of the lineages. The picture that emerges from the comparison of the entire complement of transcription factors of Arabidopsis,Drosophila, C. elegans, and S. cerevisiae is one of diversity.

  • * To whom correspondence should be addressed. E-mail: jriechmann{at}


View Abstract

Navigate This Article