Genome-Wide Insertional Mutagenesis of Arabidopsis thaliana

See allHide authors and affiliations

Science  01 Aug 2003:
Vol. 301, Issue 5633, pp. 653-657
DOI: 10.1126/science.1086391

This article has a correction. Please see:


Over 225,000 independent Agrobacterium transferred DNA (T-DNA) insertion events in the genome of the reference plant Arabidopsis thaliana have been created that represent near saturation of the gene space. The precise locations were determined for more than 88,000 T-DNA insertions, which resulted in the identification of mutations in more than 21,700 of the ∼29,454 predicted Arabidopsis genes. Genome-wide analysis of the distribution of integration events revealed the existence of a large integration site bias at both the chromosome and gene levels. Insertion mutations were identified in genes that are regulated in response to the plant hormone ethylene.

One of the most significant findings revealed through analysis of genomes of multicellular organisms is the large number of genes for which no function is known or can be predicted (1). An essential tool for the functional analysis of these completely sequenced genomes is the ability to create loss-of-function mutations for all of the genes. Thus far, the creation of gene-indexed loss-of-function mutations on a whole-genome scale has been reported only for the unicellular budding yeast Saccharomyces cerevisiae (24). Although targeted gene replacement via homologous recombination is extremely facile in yeast, its efficiency in most multicellular eukaryotes does not yet allow for the creation of a set of genome-wide gene disruptions (5, 6). Gene silencing has recently been used to study the role of ∼86% of the predicted genes of the Caenorhabditis elegans genome in several developmental processes (7, 8). The RNA interference (RNAi) method has, however, several drawbacks, including the lack of stable heritability of a phenotype, variable levels of residual gene activity (911), and the inability to simultaneously silence several unrelated genes (12).

By comparison, the creation of genome-wide collections of sequence-indexed insertion mutants has several advantages (5). We selected Agrobacterium T-DNA to generate a large collection of sequence-indexed Arabidopsis insertion mutants. About 150,000 transformed plants (T1 plants) expressing a T-DNA–located kanamycin-resistance gene (NPTII) were selected and individually propagated (13). To estimate the number of unlinked T-DNA insertions per plant line, the segregation of antibiotic resistance was scored in the progeny of 100 T1 plants. The average number of T-DNA insertions per line was found to be ∼1.5 [a number similar to other T-DNA collections (14)], and therefore, the entire collection was estimated to contain 225,000 independent T-DNA integration events. Given its genome size of ∼125,000 kb, average gene length (x) of ∼2 kb, and a random distribution of insertion events, and disregarding that insertions in essential female gametophyte genes cannot be recovered, there is a 96.6% probability (P) of obtaining an insertion in an average Arabidopsis gene [where P = 1 – (1 – [x/125,000]n) and n = the total number of insertions in the genome (15)].

To determine the precise genomic location of each T-DNA integration event, we developed a high-throughput insertion-site recovery system (13). In total, 127,706 T1 plants were processed, resulting in 99,230 T-DNA/genome junction sequences (GenBank accession numbers). The integration site of each T-DNA was located by alignment of each junction sequence with the five Arabidopsis pseudochromosomes [GenBank release date 20 August 2002; (13)]. Insertions in low-complexity regions and tandem repetitive DNAs, including 180–base pair centromeric elements and ribosomal RNA gene repeats [∼12 million base pairs (Mbp)], were excluded from this analysis. Also excluded were apparent polymerase chain reaction (PCR) plate cross-contaminants and T-DNA insertions in large, perfectly duplicated (100% sequence identify) chromosomal regions, which appeared to be artifacts of chromosome pseudomolecule assembly from individual bacterial artificial chromosome clones. In total, a conservative set of 88,122 high-quality T-DNA integration-site sequences were mapped onto the genome sequence, and a single genomic location was unambiguously determined. These sequences were used for all of the analyses presented below (Table 1, table S1).

Table 1.

Distribution of T-DNA insertions in genes and intergenic regions.

Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 Total
Promoter 5,488 3,376 4,452 3,076 4,900 21,292
5′UTR 1,243 737 951 680 1,099 4,710
Coding exon 5,089 2,960 3,988 2,871 4,440 19,348
Intron 2,663 1,507 1,840 1,681 2,284 9,975
3′UTR 1,621 914 1,263 966 1,535 6,299
Intergenic regions 6,861 4,323 5,180 3,813 6,321 26,498
Total 22,965 13,817 17,674 13,087 20,579 88,122

Our analysis of these 88,122 T-DNA insertion site sequences revealed that mutations had been identified in 21,799 of the 29,454 annotated genes or ∼74.0% of the Arabidopsis genes. In addition, two or more alleles have been identified for 15,265 genes. Analysis of multiple alleles is often crucial for gene function studies, since the mutations not linked to a T-DNA could be as high as ∼60% (14). With the exception of transposons, no significant bias was detected for T-DNA insertions for any of the gene functional categories (16). A highly nonuniform chromosomal distribution of integration events was observed (Fig. 1A). We found preferred sites of T-DNA integration or “hot spots,” as well as “cold spots” [Fig. 1A, fig.S1, and (13)]. As for whole chromosomes, fewer T-DNA integration events were consistently observed in regions surrounding each of the five centromeres (Fig. 1, B to F). The density of T-DNA insertion events was closely correlated with gene density along each chromosome: The number of T-DNA integration events decreased dramatically as our examination moved toward the centromeres from the gene-rich chromosome arms. These pericentromeric regions in Arabidopsis chromosomes are known to have lower gene density and a higher concentration of transcriptionally silent transposons and pseudogenes (1, 17).

Fig. 1.

Nonuniform distribution of T-DNAs in the Arabidopsis genome. (A) Comparison between the observed and random distribution of T-DNAs. The genome was divided into windows of 50 kb, and the windows were binned according to number of insertions. The expected number was calculated by independently permuting the insertions across windows on each chromosome. Twenty such permutations were used to estimate the expected values and variance-covariance matrix of the counts that was used in significance testing. The observed distribution is shown in yellow, and the expected random distribution is in red. There is an excess of fragments with either high or low amounts of actual T-DNA insertions (hot spots, right side; cold spots, left side). (B to F) T-DNA and gene distribution along the five chromosomes. The chromosomes were divided into windows of 50 kb. The number of T-DNA insertions and the number of predicted genes in each 50-kb window were plotted in black and red, respectively. The black and red lines represent the best fitting function for the T-DNA and gene distribution. The area between the discontinuous vertical lines corresponds to the pericentromeric regions deduced from references (1, 17).

Next, we examined the preference for T-DNA insertion events within particular genetic elements, including 5′ and 3′ untranslated regions (UTRs), coding exons, introns, and predicted promoter regions (13). The coordinates for each of these elements were deduced either from full-length cDNA sequences, which are available for 11,930 genes (table S2), or from gene predictions from the latest release of the Arabidopsis genome annotation (table S1). No significant differences were observed between the frequencies of insertion events in 5′UTRs versus 3′UTRs, nor were there differences between coding exons versus introns [Table 1, fig. S2, and (13)]. However, a significant bias was seen against integration events in introns and coding exons in favor of 5′UTR, 3′UTRs, and promoters. Moreover, when all intergenic regions were compared with all genes, we detected a small bias toward T-DNA insertions in the intergenic regions. Although there were no effects of G + C content on T-DNA integration sites observed at the genome scale (13), we found a positive correlation between the G + C content and the number of insertions in promoters, 5′UTRs, exons, and intergenic regions. Similarly, we detected a negative correlation and no correlation between the G + C content and insertion frequency in introns and 3′UTRs, respectively (fig. S2).

Although the precise mechanism of T-DNA integration in the host genome is not fully understood, a variety of host proteins appear to play important roles not only in T-DNA transport but also in integration processes (18). For example, the plant VIP2 protein, which is thought to interact with the transcriptional machinery, directly interacts with Agrobacterium VirE2 (a bacterial protein associated with the T-strand) (19). It is conceivable that the bias toward promoters and UTRs is the result of preferential interaction of the Vir proteins with host proteins involved in initiation or termination of transcription.

As recently reported for HIV integration into the human genome, the process of DNA integration can be significantly affected by gene activity (20). Thus, another plausible model for integration site preference is that uncoiling of the DNA helix during transcription initiation and termination at 5′ and 3′UTRs may allow greater accessibility to the T-DNA integration machinery (20). To test this possibility, we assessed genome-wide gene expression levels using unique expressed sequence tags (ESTs) present in GenBank for each Arabidopsis gene, as well as microarray analysis to examine the expression levels for ∼22,000 genes in plants grown under a variety of different conditions (table S3). We observed no significant correlation between the level of gene expression and the frequency of T-DNA integration. Caveats of this conclusion are that the exact cell-type infected by Agrobacterium is not known and that mixed-stage flowers may not be adequately representative of expression in the highly specialized female gametophyte cells that are the most likely target for transformation (21, 22).

To test the utility of the sequence-indexed Arabidopsis insertion mutant collection for genome-wide functional analysis, we targeted genes in the response pathway of the plant hormone ethylene (23). This simple hydrocarbon is an essential regulator of plant disease resistance, fruit ripening, and a variety of other important developmental processes in plants. The transcriptional activation of genes in response to ethylene depends on the plant-specific EIN3 and EIN3-like (EIL) family of DNA binding proteins, and these are among the most downstream signaling components in the ethylene pathway (23). Thus far, the only described direct target of this family of transcription factors is ERF1 (ETHYLENE RESPONSE FACTOR1) (24), a member of a large family of AP2-like DNA binding transcription factors known as EREBPs. To identify new genes involved in responses to this important plant-growth regulator, we used Affymetrix gene expression arrays to examine the RNA levels of more than 22,000 genes in response to ethylene (13). We identified 628 genes whose levels of expression were significantly altered by treatment with exogenous ethylene; 244 genes were induced and 384 genes were repressed by hormone treatment (table S4). The distribution according to ontology of these genes indicated that ethylene affected genes involved in many types of biological processes, from metabolism to signal transduction (16). In total, by searching our sequence-indexed T-DNA insertion-site database (25), T-DNA insertion mutations for 179 inducible and 287 repressible genes were identified (i.e., for 74.2% of all ethylene-regulated genes) (26). This percentage is in agreement with the total proportion of genes disrupted in this collection (74%). Of particular interest, in addition to ERF1, we found that the expression levels of 14 of 141 AP2 domain–containing genes were affected by ethylene treatment of etiolated seedlings (Fig. 2). In particular, four of six genes that encode proteins with two plant-specific DNA binding domains, AP2 and B3, were found to be ethylene-inducible [(16), Fig. 3B]; these genes were named ETHYLENE RESPONSE DNA BINDING FACTORS1 to 4.

Fig. 2.

Functional analysis of the AP2/EREBP multigenic family. Neighbor-joining tree of AP2 domain–containing proteins in Arabidopsis was constructed with ClustalWPPC and PAUP3.1.1 software (see supplementary data). The non-EREBP AP2-containing proteins are on the same branch of the tree and are highlighted in yellow. Six EREBP-like genes closely related to the AINTEGUMENTA family (EDFs) are highlighted in green. AP2 domain–encoding genes that were induced or repressed by ethylene by a multiple of at least two are highlighted in red and blue, respectively. Insertions in promoters or transcribed regions were found for 69 of the 141 AP2 domain–encoding genes and are marked with asterisks.

Fig. 3.

EDF knockouts. (A) Schematic representation of the four EDF family members with the respective positions of the T-DNA insertions. AP2 and B3 domains are highlighted. The coordinates of the T-DNA insertions in the promoter regions are indicated with respect to the translation start site. Insertions marked in black and red were identified by PCR screening and database search, respectively. (B) Expression levels of the EDF genes in wild-type plants and the T-DNA mutants with or without 10 ppm ethylene. Total RNA was loaded at 30 μg per lane. (C) Ethylene-insensitive phenotypes of two different quadruple-mutant combinations. Wild-type and quadruple mutants were cold-treated and germinated as described in (27) for 5 days. (D) ANOVA tables for the logarithm of root and hypocotyl lengths. The hormone:genotype term indicates that both quadruple mutants respond to the hormone treatment significantly differently from the wild type. Error bars indicate 95% confidence interval.

By searching the sequence-indexed insertion mutant database and using gene-specific PCR primers with a multidimensional DNA pooling approach, we were able to identify insertion mutant plants for each of the EDF family members (13). Although no detectable alterations in morphology were observed in the ethylene responses in any of the single mutants (Fig. 3, A and B), we found significant ethylene insensitivity in multiple-mutant plants (Fig. 3, C and D). These findings reveal an important role for the EDF1 to 4 genes in the response to ethylene. Moreover, the lack of observed phenotypes in the individual edf mutants implies a significant degree of functional overlap among the EDF gene family members. The residual ethylene sensitivity observed in the quadruple-mutant plants is consistent with the fact that the EDF genes represent only one branch of the ethylene response.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 and S2

Tables S1 to S4 and 12 data tables


References and Notes

View Abstract

Navigate This Article