The Effects of Artificial Selection on the Maize Genome

See allHide authors and affiliations

Science  27 May 2005:
Vol. 308, Issue 5726, pp. 1310-1314
DOI: 10.1126/science.1107891

This article has a correction. Please see:


Domestication promotes rapid phenotypic evolution through artificial selection. We investigated the genetic history by which the wild grass teosinte (Zea mays ssp. parviglumis) was domesticated into modern maize (Z. mays ssp. mays). Analysis of single-nucleotide polymorphisms in 774 genes indicates that 2 to 4% of these genes experienced artificial selection. The remaining genes retain evidence of a population bottleneck associated with domestication. Candidate selected genes with putative function in plant growth are clustered near quantitative trait loci that contribute to phenotypic differences between maize and teosinte. If we assume that our sample of genes is representative, ∼1200 genes throughout the maize genome have been affected by artificial selection.

Maize domestication has resulted in highly modified inflorescence and plant architecture (1). Improvement after domestication has also resulted in striking changes in yield, plant habit, biochemical composition, and other traits. At the genetic level, these phenotypic shifts are the result of strong directional (artificial) selection on target genes. With rare exceptions (2), targeted genes have not been identified.

Most domesticated plants and animals have experienced a “domestication bottleneck” that reduced genetic diversity relative to their wild ancestor (3). This bottleneck affects all genes in the genome and modifies the distribution of genetic variation among loci. The magnitude and variance of the reduction in genetic diversity across loci provide insights into the demographic history of domestication. However, in genes targeted by artificial selection, genetic diversity is reduced above and beyond that caused by the domestication bottleneck (4). Selection is similar to a more severe bottleneck (5) that removes most (or all) of the genetic variation from a target locus.

Here, we report single-nucleotide polymorphism (SNP) diversity in 774 gene fragments (100 to 900 base pairs) in a sample of 14 maize inbred lines (representing modern maize) and 16 inbred teosintes (tables S1 and S2). In this gene set, we have identified 3463 SNPs in maize and 6136 SNPs in teosinte. The polymorphism data are generally consistent with a population bottleneck during the domestication of maize.

Diversity, as measured by Watterson's estimator of the population mutation parameter θ (6), is reduced in maize relative to teosinte (Fig. 1). Our maize sample has about 57% of the variability found in its progenitor. This is somewhat lower than previous estimates that were based on a smaller number of genes (7, 8). The difference in part reflects differences in sampling and the presence of several loci with no polymorphism in maize; 65 maize genes in our data set contain no segregating sites.

Fig. 1.

Patterns of diversity in maize and teosinte at 774 gene fragments. The first column plots the observed data; the second column graphs a simulated data set with k = 2.45 (14). The first row illustrates the relationship between mean values of θ in teosinte (x axis) versus maize (y axis). Dashed diagonal lines have a slope of 1.0, representing equal diversity between taxa; solid lines are regression lines. Each square represents a single gene, with genes inferred to have been under selection in gray. The second row plots the relationship between estimates of the population recombination rate ρ in teosinte (x axis) versus maize (y axis) (14). Estimates of ρ were calculated using Hudson's (9) composite likelihood estimator. The third row shows a histogram of the frequency distribution of Tajima's D in maize (black) and teosinte (gray). The cumulative distribution of Tajima's D is also given. ΔD is the difference in average D values between maize and teosinte.

Linkage disequilibrium (LD) is increased in maize relative to teosinte. We have estimated the population recombination parameter ρ (9), which is inversely proportional to LD. The average estimate of ρ in maize is 17% that of teosinte (Fig. 1). Thus, estimates of ρ in maize have been reduced more drastically than estimates of θ, as expected under a recent population bottleneck (10). The ratio ρ/θ, the relative rate of recombination to mutation under the neutral equilibrium model, has declined sharply in maize relative to teosinte, from 4.5 in teosinte to 1.5 in maize. These results suggest that patterns of LD in maize are strongly influenced by population history, perhaps more so than by extant recombination rates.

Finally, the frequency distribution of polymorphisms, as measured by Tajima's D, has shifted between maize and teosinte (Fig. 1). In teosinte, polymorphisms are skewed toward rare variants, and the average D(D̄) across loci is -0.50. In contrast, is slightly positive (0.04) in maize, indicating a shift toward higher frequency alleles. This frequency shift is also expected after a recent bottleneck (11), because increased rates of genetic drift during a bottleneck tend to remove rare variants preferentially.

Our genome-wide estimates of SNP diversity from maize and teosinte provide a basis to estimate the demographic history of maize and to test for selection. We used coalescent simulations to infer the severity of the domestication bottleneck. Our coalescent model incorporates information about maize, such as the domestication time ∼7500 years ago (7, 12) and the inference of a single domestication event (13). The model also uses diversity data from teosinte to control for variation in stochastic effects, mutation rates, and recombination rates among loci. Our simulation method differs from previously published methods in the use of a rejection-sampling scheme, which fits simulated data to multiple summaries of the teosinte data. Our reasoning for the rejection-sampling approach is that the demographic history of teosinte is unknown but is reflected in sequence data. By conditioning simulated data on observed data, our simulations capture much of the historical features of each locus. In the rejection-sampling process, simulations that mimic teosinte data are retained, compared to maize data, and then interpreted in a likelihood framework for parameter estimation and model testing (14).

The primary parameter of interest is maize bottleneck severity (k), which is the ratio of the size of the bottlenecked population (Nb) to the duration of the bottleneck (d) in generations. To estimate k, we simulated teosinte and maize data for each of 774 loci, varying bottleneck severity for different sets of simulations to find the best fitting model. The multilocus data are most consistent with a domestication bottleneck of moderate size, with , the maximum likelihood (ML) estimate of k, equal to 2.45 (Fig. 2A). This estimate is slightly smaller than a previous estimate based on 12 loci (7), but the inference is robust to variation in model parameters such as domestication time, size of the predomestication population, and size of the current maize population (fig. S2). With independent information about d, k̂ provides insight into the founding population of maize. For example, the archaeological record suggests that the maximum estimate of d for the domestication of maize is ∼2800 years (12), assuming one generation per year for an annual plant. Under this time scale, Nb is 6860 chromosomes, which implies that fewer than 3500 individuals, or <10% (15, 16) of the teosinte population, contributed to the genetic diversity captured in our maize sample.

Fig. 2.

Likelihood results fitting the population bottleneck. (A) Likelihood surface for the strength of the population bottleneck, k, based on all 774 loci. (B) The black points show the likelihood surface fitting f, the proportion of genes in the severe bottleneck (selected) class, using the most likely two-bottleneck model (k1 = 0.15, k2 = 2.45) and all 774 loci. The gray hatched curve to the right shows the likelihood surface fitting f using the most likely two-bottleneck model (k1 = 0.001, k2 = 2.45) and loci with 10 or more segregating sites in teosinte. The two curves on the left, both with an apex at f = 0.00, represent the likelihood surface for f based on analysis of data sets simulated with a single bottleneck using rejection sampling (14). The analysis correctly estimates f to be zero, as expected under neutrality. The light gray curve with apex at f = 0.00 is based on a simulated data set from one bottleneck that is also conditioned on the observed frequency spectrum (Tajima's D) in teosinte.

How well do the observed data fit this simple bottleneck model? We generated a simulated data set under the ML estimate of = 2.45 and compared simulated data to our observed data. The observed shifts in diversity (θ), frequency spectrum (D), and recombination (ρ) between teosinte and maize fit closely with simulated data (Fig. 1). However, the absolute values of D in simulated data do not fully agree with observed values, perhaps because of some feature of teosinte demographic history that is not fully captured by our model. Nonetheless, the most likely bottleneck model is generally consistent with observed patterns of sequence diversity.

Although there is a generally good fit of the bottleneck model, selection at some loci can skew the distribution of polymorphism across loci. In particular, ∼10% of our loci have zero diversity in maize; it is unclear whether a domestication bottleneck alone can explain this observation. To examine this, we developed a likelihood ratio (LR) test to determine whether the entire data set is consistent with a single domestication bottleneck or whether the multilocus data are better explained by two classes of genes—one consisting of nonselected genes that have experienced the domestication bottleneck (k1), the other consisting of selected genes that have experienced a more severe bottleneck (k2) that mimics selection. Our method simultaneously estimates the severity of both bottlenecks (k1 and k2) and also estimates the proportion of genes (f) that fit the severe bottleneck.

The LR test provides statistically significant support for the presence of two gene classes (LR = 5.8, df = 1, P < 0.05). The first class has experienced a domestication bottleneck of severity 1 = 2.45, and the second class has undergone a bottleneck of more than 10 times the intensity (2 = 0.15). Under this model, the ML estimate of f is 0.02, indicating that 2% of our 774 maize genes are in the selected class. However, some of our 774 genes have low diversity in teosinte (Fig. 1). These genes provide little information to discriminate between the two gene classes and therefore affect the estimation of f. When we conduct the LR test on 275 genes with relatively high polymorphism in teosinte (10 or more segregating sites), the proportion of genes under selection increases to = 3.6% (LR = 4.6, P < 0.05, 1 = 2.45, 2 = 0.001). Thus, our likelihood analysis estimates that 2 to 4% of maize genes have been selected during maize domestication and improvement. Note that nonzero estimates of f are not expected under a single, moderate population bottleneck, even when we account fully for the skewed frequency distribution in teosinte (Fig. 2B).

Given these results, we used our likelihood framework to identify candidate selected genes by calculating the posterior probability (PP) that a gene is in the selected class (14). Table 1 shows the ranked PP for the top 4% of the candidate genes in our data set, as well as the PP for the tb1 locus. tb1 is included as a positive control because there is strong morphological and genetic evidence for selection on tb1 during domestication (2, 17, 18). The statistical power to detect selection may be higher for tb1 than for most of our genes because the sequences are longer and the maize sample is larger. Nonetheless, our method correctly identifies tb1 as a member of the selected class, with tb1 assigned the highest PP (87%) among all genes.

Table 1.

Candidate selected genes, sorted by posterior probability (PP).

Rank Gene or locus PP Gene name or description Putative gene function
tb1 0.87 Teosinte branched 1 (tb 1) Transcription factor
1 AY112154 0.77 Ribosomal protein L28 family Structural constituent of ribosome
2 AY105809 0.65 Acetyl transferase Transferase activity
3* AY107228 0.64 Dihydrodipicolinate synthase (DHPS) Lysine biosynthesis
4* AY106600 0.61 Adenylosuccinate synthetase Purine biosynthesis
5 AY104983 0.59 Heat shock protein (hsp70) Chaperone activity; protein folding
6† AY105958 0.58 Auxin-induced protein Transcription factor; response to sucrose
7 AY111546 0.57 Unknown Unknown
8† AY108246 0.55 Growth factor
9 AY104090 0.54 Transmembrane protein
10 AY111438 0.54 Glycosyltransferase Transferase activity; carbohydrate biosynthesis
11 AY110082 0.53 Heat shock protein Chaperone activity; protein folding
12† AY104948 0.49 Auxin response factor (ARF 1) Transcription factor
13 AY106970 0.48 Unknown
14 AY112083 0.47 Minichromosome maintenance factor 5 DNA-dependent ATPase activity; DNA replication initiation
15 AY106111 0.46 Hexokinase 1 (HXK 1) ATP binding
16 AY108481 0.45 Unknown
17 AY104147 0.44 F-box family protein
18* AY107907 0.43 Chorismate mutase (CM2) Aromatic amino acid biosynthesis
19* AY105062 0.43 Microsomal signal peptidase (SPC25) Peptidase activity
20* AY107903 0.43 Ubiquitin C-terminal hydrolase family protein Ubiquitin-dependent protein catabolism
21 AY104037 0.41 Aconitate hydratase Aconitate hydratase activity
22 AY107173 0.40 Unknown
23 AY103840 0.39 Ubiquitin/transferase family protein Trichome branching; DNA endoreduplication
24 AY108543 0.38 Early responsive to dehydration protein
25† AY104065 0.36 Cell elongation protein DWARF1 Steroid biosynthesis; cell elongation; response to light
26* AY104439 0.33 Indole synthase Indole biosynthesis
27 AY108187 0.31 Unknown
28 AY106496 0.30 Unknown
29 AY104530 0.28 CBL-interacting protein kinase 3 (CIPK3) Kinase activity; response to abiotic stimulus
30 AY107475 0.27 Basic endochitinase Chitinase activity; antifungal peptide activity
  • *Genes with putative function in amino acid biosynthesis. †Genes with putative function in plant growth.

  • The top candidate genes from our data set include genes known to be involved in plant growth and auxin response, which may contribute to the morphological differences between maize and teosinte (Table 1). In addition, our candidates identified a novel class of selected genes that function in amino acid biosynthesis and protein catabolism, suggesting selection for amino acid composition. Our inbred lines represent maize genetic diversity after domestication and breeding. We therefore cannot determine whether our high-PP genes were selected during the initial domestication event, during subsequent breeding and improvement, or both. However, amino acid composition is known to differ between maize and teosinte (19), and it is an important current target of selection for nutrition.

    Previous studies (20, 21) have identified quantitative trait loci (QTLs) for phenotypic differences between maize and teosinte. We plotted the estimated map positions of these QTLs with the PP for each of the 638 mapped loci in our study (Fig. 3) (table S2). A number of high-PP genes map near QTLs, particularly on chromosomes 1 and 5, which suggests that we may have identified selected genes associated with these morphological differences. The average distance from estimated map positions of QTLs is significantly lower for the top 4% of the candidate genes than for the rest of our 638 loci (permutation test, P < 0.05), and thus our selected genes cluster near QTLs. However, these QTLs contribute to morphological differences between species, and some of our candidates appear to be associated with other types of selection, such as biochemical composition, that we do not expect to be associated with the QTLs. Accordingly, growth-associated candidate genes and QTL locations remain clustered (P < 0.01), but amino acid biosynthesis genes do not cluster with QTLs (P = 0.66). The distribution of candidate genes also suggests that a number of selected genes fall under a single QTL (Fig. 3); this implies that morphological differentiation at a given QTL may be caused by a cluster of loci in the same pathway, consistent with a longstanding hypothesis for maize evolution (1).

    Fig. 3.

    Graphs of the posterior probabilities (PP) and map positions of 638 mapped genes (squares) against the estimated chromosomal locations of QTLs (red arrows) associated with a suite of morphological differences between teosinte and maize. Light blue, green, and magenta squares denote the top 30 high-PP genes listed in Table 1. Genes in green have a putative function in plant growth (genes 6, 8, 12, and 25 in Table 1). Genes in magenta putatively function in amino acid biosynthesis (genes 3, 4, 18, 19, 20, and 26 in Table 1). The centromeric position for each chromosome is identified by a purple line under the x axis. Chromosomal positions are roughly in units of 0.25 cM.

    We estimate that 2 to 4% of maize genes were selected during domestication and subsequent improvement. If we assume that our sample of genes is representative of the genome as a whole and that maize contains 59,000 genes (22), our results suggest that a minimum of 59,000 × 2% ≈ 1200 genes throughout the genome have been targets of selection during maize domestication and improvement. However, some of our candidate genes could be false positives, for three reasons. First, loci with very skewed frequency distributions in teosinte could lose more polymorphism during the domestication bottleneck than expected under our model, causing false inference of selection. Second, although our method confirms that the multilocus distribution of genetic diversity is inconsistent with a single bottleneck, any single gene could fall into the selected class by stochastic (nonselective) effects. Finally, rather than being the direct targets of selection, some loci could be hitchhiking with a site under selection. Indeed, candidate genes are significantly closer to the centromere than are noncandidates (P < 0.05), which suggests a greater chance of detecting selection in regions of reduced recombination, where hitchhiking should be more pronounced (23). Nonetheless, previous studies have shown that hitchhiked regions in maize tend to be relatively short (2, 24), and recombination in maize coding regions is sufficient to severely limit the physical extent of hitchhiking to one, or at most a few (25), genes. Additionally, the subset of candidates for growth and amino acid biosynthesis are not significantly closer than noncandidate genes to the centromeres (P = 0.49), which suggests that these candidates are direct targets of selection.

    Despite these caveats, our estimate of 2 to 4% of genes having been under selection is likely conservative, for three reasons. First, if selection acted on moderate-frequency variants in teosinte, selection may have had little effect on diversity levels (4). In such cases, there is little power to detect selection. Second, our method assumes that selected genes are represented by a single severe bottleneck. This approach may not detect genes subjected to subtle selection regimes. Third, high recombination rates within maize genes could reduce the hitchhiking effect to the point that only short fragments within genes retain the footprint of selection. In tb1, for example, selection in the promoter region did not affect diversity in the coding region (2). It is thus possible that we sequenced a region within a bona fide selected gene, but our region did not retain evidence of selection.

    Maize domestication prompted phenotypic change that is more extensive than in most domesticated plant species. It is thus possible that maize has a higher proportion of selected genes than most domesticates, but similar studies of additional domesticated species are required to address this issue. We have shown that our candidate genes are associated with QTL regions underlying phenotypic differences between maize and teosinte, which suggests that they contribute to traits selected during domestication. Note that these genes are depauperate for polymorphism in maize, and hence it is unlikely that they could have been identified by methods that require segregating variation within maize, such as QTL or association analysis. The statistical methodology designed for this study will prove helpful both for identifying new candidates in maize and for application to other species, such as humans, where there is interest in selection on genes during the migration of modern humans out of Africa [e.g., (26)]. Finally, additional studies of our high-PP genes may provide important insight into the pathways and mutations responsible for maize evolution.

    Supporting Online Material

    Materials and Methods

    Figs. S1 and S2

    Tables S1 and S2


    References and Notes

    View Abstract

    Navigate This Article