Research Article

Shifting the limits in wheat research and breeding using a fully annotated reference genome

See allHide authors and affiliations

Science  17 Aug 2018:
Vol. 361, Issue 6403, eaar7191
DOI: 10.1126/science.aar7191

Insights from the annotated wheat genome

Wheat is one of the major sources of food for much of the world. However, because bread wheat's genome is a large hybrid mix of three separate subgenomes, it has been difficult to produce a high-quality reference sequence. Using recent advances in sequencing, the International Wheat Genome Sequencing Consortium presents an annotated reference genome with a detailed analysis of gene content among subgenomes and the structural organization for all the chromosomes. Examples of quantitative trait mapping and CRISPR-based genome modification show the potential for using this genome in agricultural research and breeding. Ramírez-González et al. exploited the fruits of this endeavor to identify tissue-specific biased gene expression and coexpression networks during development and exposure to stress. These resources will accelerate our understanding of the genetic basis of bread wheat.

Science, this issue p. eaar7191; see also p. eaar6089

Structured Abstract

INTRODUCTION

Wheat (Triticum aestivum L.) is the most widely cultivated crop on Earth, contributing about a fifth of the total calories consumed by humans. Consequently, wheat yields and production affect the global economy, and failed harvests can lead to social unrest. Breeders continuously strive to develop improved varieties by fine-tuning genetically complex yield and end-use quality parameters while maintaining stable yields and adapting the crop to regionally specific biotic and abiotic stresses.

RATIONALE

Breeding efforts are limited by insufficient knowledge and understanding of wheat biology and the molecular basis of central agronomic traits. To meet the demands of human population growth, there is an urgent need for wheat research and breeding to accelerate genetic gain as well as to increase and protect wheat yield and quality traits. In other plant and animal species, access to a fully annotated and ordered genome sequence, including regulatory sequences and genome-diversity information, has promoted the development of systematic and more time-efficient approaches for the selection and understanding of important traits. Wheat has lagged behind, primarily owing to the challenges of assembling a genome that is more than five times as large as the human genome, polyploid, and complex, containing more than 85% repetitive DNA. To provide a foundation for improvement through molecular breeding, in 2005, the International Wheat Genome Sequencing Consortium set out to deliver a high-quality annotated reference genome sequence of bread wheat.

RESULTS

An annotated reference sequence representing the hexaploid bread wheat genome in the form of 21 chromosome-like sequence assemblies has now been delivered, giving access to 107,891 high-confidence genes, including their genomic context of regulatory sequences. This assembly enabled the discovery of tissue- and developmental stage–related gene coexpression networks using a transcriptome atlas representing all stages of wheat development. The dynamics of change in complex gene families involved in environmental adaptation and end-use quality were revealed at subgenome resolution and contextualized to known agronomic single-gene or quantitative trait loci. Aspects of the future value of the annotated assembly for molecular breeding and research were exemplarily illustrated by resolving the genetic basis of a quantitative trait locus conferring resistance to abiotic stress and insect damage as well as by serving as the basis for genome editing of the flowering-time trait.

CONCLUSION

This annotated reference sequence of wheat is a resource that can now drive disruptive innovation in wheat improvement, as this community resource establishes the foundation for accelerating wheat research and application through improved understanding of wheat biology and genomics-assisted breeding. Importantly, the bioinformatics capacity developed for model-organism genomes will facilitate a better understanding of the wheat genome as a result of the high-quality chromosome-based genome assembly. By necessity, breeders work with the genome at the whole chromosome level, as each new cross involves the modification of genome-wide gene networks that control the expression of complex traits such as yield. With the annotated and ordered reference genome sequence in place, researchers and breeders can now easily access sequence-level information to precisely define the necessary changes in the genomes for breeding programs. This will be realized through the implementation of new DNA marker platforms and targeted breeding technologies, including genome editing.

Wheat genome deciphered, assembled, and ordered.

Seeds, or grains, are what counts with respect to wheat yields (left panel), but all parts of the plant contribute to crop performance. With complete access to the ordered sequence of all 21 wheat chromosomes, the context of regulatory sequences, and the interaction network of expressed genes—all shown here as a circular plot (right panel) with concentric tracks for diverse aspects of wheat genome composition—breeders and researchers now have the ability to rewrite the story of wheat crop improvement. Details on value ranges underlying the concentric heatmaps of the right panel are provided in the full article online.

Abstract

An annotated reference sequence representing the hexaploid bread wheat genome in 21 pseudomolecules has been analyzed to identify the distribution and genomic context of coding and noncoding elements across the A, B, and D subgenomes. With an estimated coverage of 94% of the genome and containing 107,891 high-confidence gene models, this assembly enabled the discovery of tissue- and developmental stage–related coexpression networks by providing a transcriptome atlas representing major stages of wheat development. Dynamics of complex gene families involved in environmental adaptation and end-use quality were revealed at subgenome resolution and contextualized to known agronomic single-gene or quantitative trait loci. This community resource establishes the foundation for accelerating wheat research and application through improved understanding of wheat biology and genomics-assisted breeding.

Wheat (Triticum aestivum L.), the most widely cultivated crop on Earth, contributes about a fifth of the total calories consumed by humans and provides more protein than any other food source (1, 2). Breeders strive to develop improved varieties by fine-tuning genetically complex yield and end-use quality parameters while maintaining yield stability and regional adaptation to specific biotic and abiotic stresses (3). These efforts are limited, however, by insufficient knowledge and understanding of the molecular basis of key agronomic traits. To meet the demands of human population growth, there is an urgent need for wheat research and breeding to accelerate genetic gain while increasing wheat yield and protecting quality traits. In other plant and animal species, access to a fully annotated and ordered genome sequence, including regulatory sequences and genome-diversity information, has promoted the development of systematic and more time-efficient approaches for the selection and understanding of important traits (4). Wheat has lagged behind other species, primarily owing to the challenges of assembling a large (haploid genome, 1C = 16 Gb) (5), hexaploid, and complex genome that contains more than 85% repetitive DNA.

To provide a foundation for improvement through molecular breeding, the International Wheat Genome Sequencing Consortium (IWGSC) established a road map to deliver a high-quality reference genome sequence of the bread wheat cultivar Chinese Spring (CS). A chromosome survey sequence (CSS) intermediate product assigned 124,201 gene loci across the 21 chromosomes and revealed the evolutionary dynamics of the wheat genome through gene loss, gain, and duplication (6). The lack of global sequence contiguity and incomplete coverage (only 10 Gb were assembled), however, did not provide the wider regulatory genomic context of genes. Subsequent whole-genome assemblies improved contiguity (79) but lacked full annotation and did not resolve the intergenic space or present the genome in the correct physical order.

Here we report an ordered and annotated assembly (IWGSC RefSeq v1.0) of the 21 chromosomes of the allohexaploid wheat cultivar CS, an achievement that is built on a rich history of chromosome studies in wheat (1012), which allowed the integration of genetic and genomic resources. The completeness and accuracy of IWGSC RefSeq v1.0 provide insights into global genome composition and enable the construction of complex gene coexpression networks to identify central regulators in critical pathways, such as flowering-time control. The ability to resolve the inherent complexity of gene families related to important agronomic traits demonstrates the impact of IWGSC RefSeq v1.0 on dissecting quantitative traits genetically and implementing modern breeding strategies for future wheat improvement.

Chromosome-scale assembly of the wheat genome

Pseudomolecule sequences representing the 21 chromosomes of the bread wheat genome were assembled by integrating a draft de novo whole-genome assembly (WGA), built from Illumina short-read sequences using NRGene deNovoMagic2 (Fig. 1A, Table 1, and tables S1 and S2), with additional layers of genetic, physical, and sequence data (tables S3 to S8 and figs. S1 and S2). In the resulting 14.5-Gb genome assembly, contigs and scaffolds with N50s of 52 kb and 7 Mb, respectively, were linked into superscaffolds (N50 = 22.8 Mb), with 97% (14.1 Gb) of the sequences assigned and ordered along the 21 chromosomes and almost all of the assigned sequence scaffolds oriented relative to each other (13.8 Gb, 98%). Unanchored scaffolds comprising 481 Mb (2.8% of the assembly length) formed the “unassigned chromosome” (ChrUn) bin. The quality and contiguity of the IWGSC RefSeq v1.0 genome assembly were assessed through alignments with radiation hybrid maps for the A, B, and D subgenomes [average Spearman’s correlation coefficient (r) of 0.98], the genetic positions of 7832 and 4745 genotyping-by-sequencing derived genetic markers in 88 double haploid and 993 recombinant inbred lines (Spearman’s r of 0.986 and 0.987, respectively), and 1.24 million pairs of neighbor insertion site–based polymorphism (ISBP) markers (13), of which 97% were collinear and mapped in a similar size range (difference of <2 kb) between the de novo WGA and the available bacterial artificial chromosome (BAC)–based sequence assemblies. Finally, IWGSC RefSeq v1.0 was assessed with independent data derived from coding and noncoding sequences, revealing that 99 and 98% of the previously known coding exons (6) and transposable element (TE)–derived (ISBP) markers (table S9), respectively, were present in the assembly. The approximate 1-Gb size difference between IWGSC RefSeq v1.0 and the new genome size estimates of 15.4 to 15.8 Gb (14) can be accounted for by collapsed or unassembled sequences of highly repeated clusters, such as ribosomal RNA coding regions and telomeric sequences.

Fig. 1 Structural, functional, and conserved synteny landscape of the 21 wheat chromosomes.

(A) Circular diagram showing genomic features of wheat. The tracks toward the center of the circle display (a) chromosome name and size (100-Mb tick size; light gray bar indicates the short arm and dark gray indicates the long arm of the chromosome); (b) dimension of chromosomal segments R1, R2a, C, R2b, and R3 [(18) and table S29]; (c) K-mer 20-frequencies distribution; (d) LTR-retrotransposons density; (e) pseudogenes density (0 to 130 genes per Mb); (f) density of HC gene models (0 to 32 genes per Mb); (g) density of recombination rate; and (h) SNP density. Connecting lines in the center of the diagram highlight homeologous relationships of chromosomes (blue lines) and translocated regions (green lines). (B) Distribution of Pfam domain PF08284 “retroviral aspartyl protease” signatures across the different wheat chromosomes. (C) Positioning of the centromere in the 2D pseudomolecule. Top panel shows density of CENH3 ChIP-seq data along the wheat chromosome. Bottom panel shows distribution and proportion of the total pseudomolecule sequence composed of TEs of the Cereba and Quinta families. The bar below the bottom panel indicates pseudomolecule scaffolds assigned to the short (black) or long (blue) arm on the basis of CSS data (6) mapping. (D) Dot-plot visualization of collinearity between homeologous chromosomes 3A and 3B in relation to distribution of gene density and recombination frequency (left and bottom panel boxes: blue and purple lines, respectively). Chromosomal zones R1, R2a, C, R2b, and R3 are colored as in (A). cM, centimorgan.

Table 1 Assembly statistics of IWGSC RefSeq v1.0.

View this table:

A key feature distinguishing the IWGSC RefSeq v1.0 from previous draft wheat assemblies (69) is the long-range organization, with 90% of the genome represented in superscaffolds larger than 4.1 Mb and with each chromosome represented, on average, by only 76 superscaffolds (Table 1). The largest superscaffold spans 166 Mb, which is half the size of the rice (Oryza sativa L.) genome and is larger than the Arabidopsis thaliana L. genome (15, 16). Moreover, the 21 pseudomolecules position molecular markers for wheat research and breeding [504 single-stranded repeats (SSRs), 3025 diversity array technologies (DArTs), 6689 expressed sequence tags (ESTs), 205,807 single-nucleotide polymorphisms (SNPs), and 4,512,979 ISBPs] (table S9), thus providing a direct link between the genome sequence and genetic loci and genes underlying traits of agronomic importance.

The composition of the wheat genome

Analyses of the components of the genome sequence revealed the distribution of key elements and enabled detailed comparisons of the homeologous A, B, and D subgenomes. Accounting for 85% of the genome, with a relatively equal distribution across the three subgenomes (Table 2), 3,968,974 copies of TEs belonging to 505 families were annotated. Many (112,744) full-length long terminal repeat (LTR)–retrotransposons were identified that have been difficult to define from short-read sequence assemblies (fig. S3). Although the TE content has been extensively rearranged through rounds of deletions and amplifications since the divergence of the A, B, and D subgenomes about 5 million years ago, the TE families that shaped the Triticeae genomes have been maintained in similar proportions: 76% of the 165 TE families present in a cumulative length greater than 1 Mb contributed similar proportions (less than a twofold difference between subgenomes), and only 11 families, accounting for 2% of total TEs, showed a higher than threefold difference between two subgenomes (17). TE abundance accounts, in part, for the size differences between subgenomes—for example, 64% of the 1.2-Gb size difference between the B and D subgenomes can be attributed to lower gypsy retrotransposon content. Low-copy DNA content (primarily unclassified sequences) also varied between subgenomes, accounting, for example, for 97 Mb of the 245-Mb size difference between A and B subgenomes (fig. S4). As reported (18), no evidence was found for a major burst of transposition after polyploidization. The independent evolution in the diploid lineages was reflected in differences in the specific composition of the A, B, and D subgenomes at the subfamily (variants) level, as evidenced by subgenome-specific overrepresentation of individual transposon domain signatures (Fig. 1B). See (17) for a more detailed analysis of the TE content and its impact on the evolution of the wheat genome.

Table 2 Relative proportions of the major elements of the wheat genome.

Proportions of TEs are given as the percentage of sequences assigned to each superfamily relative to genome size. Abbreviations in parentheses under the headings “Class 1” and “Class 2” indicate transposon types.

View this table:

In addition to TEs, annotation of the intergenic space included noncoding RNAs. We identified eight new microRNA families (fig. S5 and table S10) and the entire complement of tRNAs (which showed an excess of lysine tRNAs, fig. S6). Around 8000 nuclear-inserted plastid DNA segments and 11,000 nuclear-inserted mitochondrial DNA segments representing, respectively, 5 and 17 Mb were also revealed by comparing the genome assembly with complete plastid and mitochondrial genomes assembled from the IWGSC RefSeq v1.0 raw read data (14).

Precise positions for the centromeres were defined by integrating Hi-C, CSS (6), and published chromatin immunoprecipitation sequencing (ChIP-seq) data for CENH3, a centromere-specific histone H3 variant (19). Clear ChIP-seq peaks were evident in all chromosomes and coincided with the centromere-specific repeat families (Fig. 1C, fig. S7, and table S11). CENH3 targets were also found in unassigned sequence scaffolds (ChrUn), indicating that centromeres of several chromosomes are not yet completely resolved. On the basis of these data, a conservative estimate for the minimal average size of a wheat centromere is 4.9 Mb (6.7 Mb, if including ChrUn; table S11), compared with an average centromere size of ~1.8 Mb in maize (20, 21) and 0.4 to 0.8 Mb in rice (22).

Gene models were predicted with two independent pipelines previously utilized for wheat genome annotation and then consolidated to produce the RefSeq Annotation v1.0 (fig. S8). Subsequently, a set of manually curated gene models was integrated to build RefSeq Annotation v1.1 (fig. S9 and tables S12 to S17). In total, 107,891 high-confidence (HC) protein-coding loci were identified, with relatively equal distribution across the A, B, and D subgenomes (35,345, 35,643, and 34,212, respectively; Figs. 1D and 2A, fig. S10, and table S18). In addition, 161,537 other protein-coding loci were classified as low-confidence (LC) genes, representing partially supported gene models, gene fragments, and orphans (table S18). A predicted function was assigned to 82.1% (90,919) of HC genes in RefSeq Annotation v1.0 (tables S19 and S20), and evidence for transcription was found for 85% (94,114) of the HC genes versus 49% of the LC genes (23). Within the pseudogene category, 25,419 (8%) of 303,818 candidates matched LC gene models. The D subgenome contained significantly fewer pseudogenes than the A and B subgenomes (81,905 versus 99,754 and 109,097, respectively; χ2 test, P < 2.2 × 10−16) (tables S21 and S22 and fig. S10). In ChrUn, 2691 HC and 675 LC gene models were identified.

Fig. 2 Evaluation of automated gene annotation.

(A) Selected gene prediction statistics of IWGSC RefSeq Annotation v1.1, including number and subgenome distribution of HC and LC genes as well as pseudogenes. (B) BUSCO v3 gene model evaluation comparing IWGSC RefSeq Annotation v1.1 to earlier published bread wheat whole-genome annotations, as well as to annotations of related grass reference-genome sequences. BUSCO provides a measure for the recall of highly conserved gene models.

The quality of the RefSeq Annotation v1.1 gene set was benchmarked against BUSCO v3 (24), representing 1440 Embryophyta near-universal single-copy orthologs and published annotated wheat gene sets (Fig. 2B and fig. S11). Of the BUSCO v3 genes, 99% (1436) were represented in at least one complete copy in RefSeq Annotation v1.1 and 90% (1292) in three complete copies, an improvement over the 25% (353) and 70% (1014) of BUSCO v3 genes that were identified in the IWGSC (6) and TGACv1 (8) gene sets, respectively (Fig. 2B). Improved contiguity of sequences in the immediate vicinity of genes was also found: 61% of the HC and LC genes were flanked by at least 10 kb of sequence without ambigous bases (Ns), in contrast to 37% and only 5% of the HC and LC genes in the TGACv1 and IWGSC CSS gene models, respectively (fig. S12).

To further characterize the gene space, a phylogenomic approach was applied to identify gene homeologs and paralogs between and within the wheat subgenomes and orthologs in other plant genomes (table S23 and figs. S13 to S15). Analysis of a subset of 181,036 genes [“filtered gene set,” (14) and Table 3] comprising 103,757 HC and 77,279 LC genes identified 39,238 homeologous groups—that is, clades of A, B, and D subgenome orthologs deduced from gene trees—containing a total of 113,653 genes (63% of the filtered set). Gene losses or retention and gene gains (gene duplications) were determined for all homeologous loci of IWGSC RefSeq v1.0 (Table 3), assuming the presence of a single gene copy at every homeologous locus (referred to as a “triad”). The percentage of genes in homeologous groups for all configurations (ratios) is highly similar, hence balanced, across the three subgenomes: 63% (A), 61% (B), and 66% (D). The slightly higher percentage of homeologs in the D subgenome, together with the lower number of pseudogenes (table S22), is consistent with its more recent hybridization with the AABB tetraploid genome progenitor. Although most of the genes are present in homeologous groups, only 18,595 (47%) of the groups contained triads with a single gene copy per subgenome (an A:B:D configuration of 1:1:1). Of the groups of homeologous genes, 5673 (15%) exhibited at least one subgenome inparalog, that is, a gene copy resulting from a tandem or a segmental trans duplication (1:1:N A:B:D configuration; N indicates a minimum of one additional paralog per respective subgenome). The three genomes exhibited similar levels of loss of individual homeologs, affecting 10.7% (0:1:1), 10.3% (1:0:1), and 9.5% (1:1:0) of the homeologous groups in the A, B, and D subgenomes, respectively (Table 3 and tables S24 and S25).

Table 3 Groups of homeologous genes in wheat.

Homeologous genes are “subgenome orthologs” and were inferred by species tree reconciliation in the respective gene family. Numbers include both HC and LC genes filtered for TEs (filtered gene set). Conserved subgenome-specific (orphan) genes are found only in one subgenome but have homologs in other plant genomes used in this study. This includes orphan outparalogs resulting from ancestral duplication events and conserved only in one of the subgenomes. Nonconserved orphans are either singletons or duplicated in the respective subgenome, but neither have obvious homologs in the other subgenomes or the other plant genomes studied. Microsynteny is defined as the conservation and collinearity of local gene ordering between orthologous chromosomal regions. Macrosynteny is defined as the conservation of chromosomal location and identity of genetic markers like homeologs but may include the occurrence of local inversions, insertions, or deletions. Additional data are presented in table S24.

View this table:

Of the 67,383 (37%) genes of the filtered set not present in homeologous groups, 31,140 genes also had no orthologs in species included in the comparisons outside of bread wheat and mainly comprised gene fragments, non–protein-coding loci with open reading frames, or other gene-calling artifacts. The remaining 36,243 genes had homologs outside of bread wheat and appeared to be subgenome specific (Table 3). Two of the genes in this category were granule bound starch synthase (GBSS) on chromosome 4A (1:0:0, a gene that is a key determinant of udon noodle quality) and ZIP4 within the pairing homeologous 1 (Ph1) locus on chromosome 5B [0:1:0, a locus critical for the diploid meiotic behavior of the wheat homeologous chromosomes (25)]. The phylogenomic analysis indicated that the GBSS on 4A is a divergent translocated homeolog originally located on chromosome 7B (fig. S16), whereas ZIP4 is a transduplication of a chromosome 3B locus (table S26). Both genes confer important properties on wheat and illustrate the diversity in origin and function of gene models that are not in a 1:1:1 configuration. No evidence was found for biased partitioning. Rather, our analyses support gradual gene loss and gene movement among the subgenomes that may have occurred in either the diploid progenitor species or the tetraploid ancestor or following the final hexaploidization event in modern bread wheat (Table 3 and figs. S24 and S25). Together with the equal contribution of the three homeologous genomes to the overall gene expression (23), this demonstrates the absence of subgenome dominance (26).

Of the bread wheat HC genes, 29,737 (27%) are present as tandem duplicates, which is up to 10% higher than that found for other monocotyledonous species (table S27). Tandemly repeated genes are most prevalent in the B subgenome (29%), contributing to its higher gene content and larger number of 1:N:1 homeologous groups (Table 3). The postulated hybrid origin of the D subgenome, as a result of interspecific crossing with AABB tetraploid genome progenitors 1 to 2 million years after they diverged (27), is consistent with the synonymous substitution rates of homeologous gene pairs (fig. S17). Homeologous groups with gene duplicates in at least one subgenome (1:1:N, 1:N:1, or N:1:1) showed elevated evolutionary rates (for the subgenome carrying the duplicate) as compared with strict 1:1:1 or 1:1 groups (figs. S18 to S22). Homeologs with recent duplicates also showed higher levels of expression divergence (fig. S23), consistent with gene and genome duplications acting as a driver of functional innovation (28, 29).

Analysis of synteny between the seven triplets of homeologous chromosomes showed high levels of conservation. There was no evidence that any major rearrangements occurred since the A, B, and D subgenomes diverged ~5 million years ago (Fig. 1D), although collinearity between homeologs was disturbed by inversions occurring, on average, every 74.8 Mb, involving blocks of 10 genes or more (mean gene number of 48.2 with a mean size of 10.5 Mb) (Fig. 1D and table S28). Macrosynteny was conserved across centromere (C) regions, but collinearity (microsynteny) broke down specifically in these recombination-free, gene-poor regions for all seven sets of homeologous chromosomes (Fig. 1D, figs. S24 to S26, and table S29). Of the 113,653 homeologous genes, 80% (90,232) were found organized in macrosynteny, that is, still present at their ancestral position (table S24). At the microsynteny scale, 72% (82,308) of the homeologs were organized in collinear blocks, that is, intervals with a highly conserved gene order (Fig. 1D). A higher proportion of syntenic genes was found in the interstitial regions [short arm, R2a (18), 46% and long arm, R2b (18), 61%] than in the distal telomeric [short arm, R1 (18), 39% and long arm, R3 (18), 51%] and centromere regions [C (18), 29%], and the interstitial compartments harbored larger syntenic blocks (figs. S27 and S28). The higher proportions of duplicated genes in distal-terminal regions (34 and 27% versus 13 to 15% in the other regions; fig. S29) exerted a strong influence on the decay of syntenic block size and contributed to the higher sequence variability in these regions. Overall, distal chromosomal regions are the preferential targets of meiotic recombination and the fastest evolving compartments. As such, they represent the genomic environment for creating sequence, hence allelic, diversity, providing the basis for adaptability to changing environments.

Atlas of transcription reveals trait-associated gene co-regulation networks

The gene annotation, coupled with identification of homeologs and paralogs in IWGSC RefSeq v1.0, provides a resource to study gene expression in genome-wide and subgenome contexts. A total of 850 RNA-seq samples derived from 32 tissues at different growth stages and/or challenged by different stress treatments were mapped to RefSeq Annotation v1.0 (Fig. 3A, database S1, and tables S30 to S32). Expression was observed for 94,114 (84.9%) HC genes (fig. S30) and for 77,920 (49.1%) LC genes, the latter showing lower expression breadth and level [median six tissues; average 2.9 transcripts per million (tpm)] than the HC genes (median 20 tissues, average 8.2 tpm) (fig. S31). This correlated with the higher average methylation status of LC genes (figs. S32 and S33). A principal component analysis identified tissue (Fig. 3B), rather than growth stage or stress (fig. S34), as the main factor driving differential expression between samples, consistent with studies in other organisms (3033). Of the total number of genes, 31.0% are expressed in more than 90% of tissues (average 16.9 tpm, ≥30 tissues), and 21.5% are expressed in 10% or fewer tissues (average 0.22 tpm, ≤3 tissues; fig. S31).

Fig. 3 Wheat atlas of transcription.

(A) Schematic illustration of a mature wheat plant and high-level tissue definitions for “roots,” “leaves,” “spike,” and “grain” used in the further analysis. (B) Principal component (PC) analysis plots for similarity of overall transcription, with samples colored according to their high-level tissue of origin [as introduced in (A)]. The color key for tissue is shown at the bottom of the figure under (C). (C) Chromosomal distribution of the average expression breadth [number of tissues in which genes are expressed (total number of tissues, n = 32)]. The average (dark orange line) is calculated on the basis of a scaled position of each gene within the corresponding genomic compartment (blue, aqua, and light yellow background) across the 21 chromosomes (orange lines). (D) Heatmap illustrating the expression of a representative gene (eigengene) for the 38 coexpression modules defined by WGCNA. Modules are represented as columns, with the dendrogram illustrating eigengene relatedness. Each row represents one sample. Colored bars to the left indicate the high-level tissue of origin; the color key is shown at the bottom of the figure under (C). DESeq2-normalized expression levels are shown. Modules 1 and 5 (light green boxes) were most correlated with high-level leaf tissue, whereas modules 8 and 11 (dark green boxes) were most correlated with spike. (E) Bar plot of module assignment (same, near, or distant) of homeologous triads and duplets in the WGCNA network. (F) Simplified flowering pathway in polyploid wheat. Genes are colored according to their assignment to leaf (light green)– or spike (dark green)–correlated modules. (G) Excerpt from phylogenetic tree for MADS transcription factors, including known Arabidopsis flowering regulators SEP1, SEP2, and SEP4 (black) (for the full phylogenetic tree, see fig. S38). Green branches represent wheat orthologs of modules 8 and 11, whereas purple branches are wheat orthologs assigned to other modules (0 and 2). Gray branches indicate non-wheat genes.

Of the HC genes, 8231 showed tissue-exclusive expression (fig. S35). About half of these were associated with reproductive tissues (microspores, anther, and stigma or ovary), consistent with observations in rice (34). The tissue-exclusive genes were enriched for response to extracellular stimuli and reproductive processes (database S2). By contrast, 23,146 HC genes expressed across all 32 tissues were enriched for biological processes associated with housekeeping functions such as protein translation and protein metabolic processes. Tissue-specific genes were shorter [1147 ± 8 base pairs (bp)], had fewer exons (2.76 ± 0.3), and were expressed at lower levels (3.4 ± 0.1 tpm) compared with ubiquitous genes (1429 ± 7 bp, 7.87 ± 0.4 exons, and 17.9 ± 0.4 tpm) (fig. S35).

Genes located in distal regions R1 and R3 (fig. S25 and table S29) showed lower expression breadth than those in the proximal regions (15.7 and 20.7 tissues, respectively) (Fig. 3C and fig. S36). This correlated with enrichment of Gene Ontology (GO) slim terms such as “cell cycle,” “translation,” and “photosynthesis” for genes in the proximal regions, whereas genes enriched for “response to stress” and “external stimuli” were found in the highly recombinant distal R1 and R3 regions (database S3, fig. S36, and table S33). The expression breadth pattern was also correlated with the distribution of the repressive H3K27me3 (trimethylated histone H3 lysine 27) (Pearson r = −0.76, P < 2.2 × 10−16) and with the active H3K36me3 and H3K9ac (acetylated H3K9) (Pearson r = 0.9 and 0.83, respectively; P < 2.2 × 10−16) histone marks (fig. S37).

Global patterns of coexpression (35) were determined with a weighted gene coexpression network analysis (WGCNA) on 94,114 expressed HC genes. Of these genes, 58% (54,401) could be assigned to 38 modules (Fig. 3D and database S4), and, consistent with the principal component analysis, tissues were the major driver of module identity (Fig. 3D and figs. S38 to S40). The analysis focused initially on the 9009 triads (syntenic and nonsyntenic) with a 1:1:1 A:B:D relationship and for which all homeologs were assigned to a module. Of the triads, 16.4% had at least one homeolog in a divergent module, with the B homeolog most likely to be divergent (37.4% B-divergent versus 31.7% A-divergent and 30.9% D-divergent triads, χ2 test P = 0.007). However, the expression profiles of most (83.6%) of the triads were relatively consistent with all homeologs in the same (57.6%) or a closely related (26.0%) module. The proportion of homeologs found within the same module was higher than expected, pointing to a highly conserved expression pattern of homeologs across the 850 RNA-seq samples (Fig. 3E and table S34). Triads with at least one gene in a nonsyntenic position had a higher amount of divergent expression patterns compared to syntenic triads (21.2 versus 16.2%, χ2 test P < 0.001) and fewer such triads shared all homeologs in the same module (48.7%) compared to syntenic triads (58.0%, chi-square test P = 0.009). Similar patterns were observed in the 1933 duplets that have a 1:1 relationship between only two homeologs (table S34). These results are consistent with syntenic homeologs showing similar expression patterns, whereas more dramatic changes in chromosome context associate with divergent expression and possible sub- or neofunctionalization. These trends were also found across diverse tissue-specific networks (23).

To explore the potential of the WGCNA network for identifying previously uncharacterized pathways in wheat, a search was undertaken for modules containing known regulators of wheat flowering time [e.g., PPD1 (36) and FT (37); Fig. 3F]. Genes belonging to this pathway were grouped into specific modules. The upstream genes (PHYB, PHYC, PPD1, ELF3, and VRN2) were present mainly in modules 1 and 5 and were most highly correlated with expression in leaf and shoot tissues (0.68 and 0.67, respectively; adjusted P < 1 × 10−108). By contrast, the integrating gene FT and downstream genes VRN1, FUL2, and FUL3 were found in modules 8 and 11, most highly correlated with expression in spikes (0.69 and 0.65, respectively; adjusted P < 1 × 10−101; table S35). The MADS_II transcription factor family that is generally associated with the above pathways was examined more closely, with a focus on the gene tree OG0000041, which contains 54 of the 118 MADS_II genes in wheat. Twenty-four MADS_II genes from modules 8 and 11 were identified within this gene tree, clustering into two main clades along with Arabidopsis and rice orthologs associated with floral patterning (fig. S41 and database S5). Within these clades, other MADS_II genes were found that were not in modules 8 or 11 (Fig. 3G), indicating a different pattern of coexpression. None of the 24 MADS_II genes had a simple 1:1 ortholog in Arabidopsis, suggesting that some wheat orthologs function in flowering (those within modules 8 and 11), whereas others could have developed different functions, despite being phylogenetically closely related. Thus, these data provide a framework to identify and prioritize the most likely functional orthologs of known model system genes within polyploid wheat, to characterize them functionally (38), and to dissect genetic factors controlling important agronomic traits (39, 40). A more detailed analysis of tissue-specific and stress-related networks (23) provides a framework for defining quantitative variation and interactions between homeologs for many agronomic traits (41).

Gene-family expansion and contraction with relevance to wheat traits

Gene duplication and gene-family expansion are important mechanisms of evolution and environmental adaptation, as well as major contributors to phenotypic diversity (42, 43). In a phylogenomic comparative analysis, wheat gene-family size and wheat-specific gene-family expansion and contraction were benchmarked against nine other grass genomes, including five closely related diploid Triticeae species (table S23 and figs. S13 to S15 and S42). A total of 30,597 gene families (groups of orthologous genes traced to a last common ancestor in the evolutionary hierarchy of the compared taxa) were defined, with 26,080 families containing gene members from at least one of the three wheat subgenomes (tables S36 to S39). Among the 8592 expanded wheat gene families (33% of all families), 6216 were expanded in all three A, B, and D subgenomes (24%; either shared with the wild ancestor or specific to bread wheat, Fig. 4A). Another 1109 were expanded in only one of the wheat subgenomes, and 2102 gene families were expanded in either the A or the D genome lineages (Fig. 4A, fig. S43, and table S36). Overall, only 78 gene families were contracted in wheat. The number of gene families that are only expanded in wheat may be overestimated owing to limited completeness of the draft progenitor wheat genome assemblies used in this study (14) (table S39). Gene Ontology (GO; ontology of biomedical terms for the areas “cellular component,” “biological process,” and “molecular function”), Plant Ontology (PO; ontology terms describing anatomical structures and growth and developmental stages across Viridiplantae), and Plant Trait Ontology [TO; ontology of controlled vocabulary to describe phenotypic traits and quantitative trait loci (QTLs) that were physically mapped to a gene in flowering plant species] analyses identified 1169 distinct GO, PO, and TO terms (15% of all assigned terms) enriched in genes belonging to expanded wheat gene families (Fig. 4B and figs. S44 and S45). “A-subgenome” or “A-lineage” expanded gene families showed a bias for terms associated with seed formation [overrepresentation of the TO term “plant embryo morphology” (TO:0000064) and several seed, endosperm, and embryo-developmental GO terms] (fig. S46). Similarly, “B-subgenome” expanded gene families were enriched for TO terms related to plant vegetative growth and development (database S6 and fig. S47). Gene families that were expanded in all wheat subgenomes were enriched for 14 TO terms associated with yield-affecting morphological traits and five terms associated with fertility and abiotic-stress tolerance (Fig. 4B), which was also mirrored by enrichment for GO and PO terms associated with adaptation to abiotic stress (“salt stress” and “cold stress”) and grain yield and quality (“seed maturation,” “dormancy,” and “germination”). The relationship between the patterns of enriched TO, PO, and GO terms for expanded wheat gene families and key characteristics of wheat performance (figs. S45 to S51) provides a resource (database S6) to explore future QTL mapping and candidate gene identification for breeding.

Fig. 4 Gene families of wheat.

(A) Heatmap of expanded and contracted gene families. Columns correspond to the individual gene families. Rows in the top panel illustrate the sets of gene-family expansions (++, red) and contractions (––, blue) found for the wheat A lineage (Triticum urartu and A subgenome); the D lineage (Aegilops tauschii and D subgenome); the A, B, or D subgenomes; or bread wheat (expanded and contracted in all subgenomes). In the latter four categories, expansions and contractions do not imply bread wheat–specific gene copy number variations. Similar dynamics might have remained unobserved in T. urartu or A. tauschii owing to the inherent limitations of the used draft genome assemblies (53, 54). Rows in the bottom panel heatmap (color scheme on z-score scale) indicate the fold expansion and contraction of gene families for the taxa and species included in the analysis [Oryza sativa (Osat), Sorghum bicolor (Sbic), Zea mays (Zmay), Brachypodium distachyon (Bdis), Hordeum vulgare (Hvul1/2), Secale cereale (Scer), A. tauschii (Aetau), T. urartu (Tura), and wheat A (TraesA), B (TraesB), and D (TraesD) subgenomes]. (B) All enriched TO terms for the gene families depicted in (A). Overrepresented TO terms were found for expanded families in bread wheat (all subgenomes, red), the B subgenome (green), and the A lineage (T. urartu and A subgenome, blue) only, respectively. The x axis represents the percentage of genes annotated with the respective TO term that were contained in the gene set in question. The size of the bubbles corresponds to the P (−log10) significance of expansion. (C) Genomic distribution of gene families associated with adaptation to biotic (light and dark blue) or abiotic stress (light and dark pink), RNA metabolism in organelles and male fertility (orange), or end-use quality (light, medium, and dark green). Known positions of agronomically important genes and loci are indicated by red arrows and arrowheads to the left of the chromosome bars. Recombination rates are displayed as heatmaps in the chromosome bars [7.2 cM/Mb (light green) to 0 cM/Mb (black)].

Many gene families with high relevance to wheat breeding and improvement were among the expanded group, and their genomic distribution was analyzed in greater detail (Fig. 4C and figs. S52 to S54). Disease resistance–related NLR (nucleotide-binding site leucine-rich repeat)–like loci and WAK (wall-associated receptor)–like genes were clustered in high numbers at the distal (R1 and R3) regions of all chromosome arms, with NLRs often co-localizing with known disease resistance loci (Fig. 4C). The restorer-of-fertility–like (RFL) subclade of P class pentatricopeptide repeat (PPR) proteins, potentially of interest for hybrid wheat production, comprised 207 genes, nearly threefold more per haploid subgenome than have been identified in any other plant genome analyzed to date (44, 45). They localized mainly as clusters of genes in regions on the group 1, 2, and 6 chromosomes, which carry fertility-restoration QTLs in wheat (Fig. 4C and fig. S54). Within the dehydrin gene family, implicated with drought tolerance in plants, 25 genes that formed well-defined clusters on chromosomes 6A, 6B, and 6D (figs. S53 and S55) showed early increased expression under severe drought stress. As the structural variation in the CBF genes of wheat is known to be associated with winter survival (46), the array of CBF paralogs at the Fr-2 locus (fig. S56) revealed by IWGSC RefSeq v1.0 provides a basis for targeted allele mining for previously uncharacterized CBF haplotypes from highly frost-tolerant wheat genetic resources. Lastly, high levels of expansion and variation in members of grain prolamin gene families [fig. S52 and (47)] either related to the response to heat stress or whose protein epitopes are associated with levels of celiac disease and food allergies (47) provide candidates for future selection in breeding programs. From these few examples, it is evident that flexibility in gene copy numbers within the wheat genome has contributed to the adaptability of wheat to produce high-quality grain in diverse climates and environments (48). Knowledge of the complex picture of the genome-wide distribution of gene families (Fig. 4C), which needs to be considered for selection in breeding programs in the context of distribution of recombination and allelic diversity, can now be applied to wheat improvement strategies. This is especially true if “must-have traits” that are allocated in chromosomal compartments with highly contrasting characteristics are fixed in repulsion or are found only in incompatible gene pools of the respective breeding germplasm.

Rapid trait improvement using physically resolved markers and genome editing

The selection and modification of genetic variation underlying agronomic traits in breeding programs is often complicated if phenotypic selection depends on the expression of multiple loci with quantitative effects that can be strongly influenced by the environment. This dilemma can be overcome if DNA markers in strong linkage disequilibrium with the phenotype are identified through forward genetic approaches or if the underlying genes can be targeted through genome editing. The potential for IWGSC RefSeq v1.0, together with the detailed genome annotation, to accelerate the identification of potential candidate genes underlying important agronomic traits was exemplified for two targets. A forward genetics approach was used to fully resolve a QTL for stem solidness (SSt1) conferring resistance to drought stress and insect damage (49) that was disrupted in previous wheat assemblies by a lack of scaffold ordering and annotation, partial assembly, and/or incomplete gene models (fig. S57 and tables S40 and S41). In IWGSC RefSeq v1.0, SSt1 contains 160 HC genes (table S42), of which 26 were differentially expressed (DESeq2, Benjamini-Hochberg adjusted P < 0.01) between wheat lines with contrasting phenotypes. One of the differentially expressed genes, TraesCS3B01G608800, was present as a single copy in IWGSC RefSeq v1.0 but showed copy number variation associated with stem solidness in a diverse panel of hexaploid cultivars (Fig. 5A, fig. S58, and table S43). Using IWGSC RefSeq v1.0, we developed a diagnostic SNP marker physically linked to the copy number variation that has been deployed to select for stem solidness in wheat breeding programs (Fig. 5B).

Fig. 5 IWGSC RefSeq v1.0–guided dissection of SSt1 and TaAGL33.

(A) The Lillian-Vesper population genetic map was anchored to IWGSC RefSeq v1.0 (left), and differentially expressed genes were identified between solid- and hollow-stemmed lines of hexaploid (bread) and tetraploid (durum) wheat (right). (B) Cross-sectioned stems of Lillian (solid) and Vesper (hollow) are shown as a phenotypic reference (top). Increased copy number of TraesCS3B01G608800 [annotated as a DOF (DNA-binding one-zinc finger) transcription factor] is associated with stem phenotypic variation (bottom). (C) A high-throughput SNP marker tightly linked to TraesCS3B01G608800 reliably discriminates solid- from hollow-stemmed wheat lines. Relative intensity of the fluorophores (FAM and HEX) used in KASPar analysis are shown. Vertical axis shows FAM signal; horizontal axis shows HEX signal. (D) Schematic of the three TaAGL33 proteins, showing the typical MADS, I, K, and C domains. Triangles indicate the position of the five introns that occur in all three homeologs. Bars indicate the position of single-guide RNAs designed for exons 2 and 3. Three T-DNA vectors—each containing the bar selectable marker gene, CRISPR nuclease, and one of three single-guide RNA sequences—were used for Agrobacterium-mediated wheat transformation, essentially as described earlier (55). Transgenic plants were obtained with edits at the targeted positions in all TaAGL33 homeologs. The putatively resulting protein sequence is displayed starting close to the edits, with wild-type amino acids (aa) in black font and amino acids resulting from the induced frame shifts in red font. * indicates premature termination codons. (E) Mean days to flowering (after 8 weeks of vernalization) for progeny of four homozygous edited plants (light gray bars) and the respective homozygous wild-type segregants (dark gray bars). Numbers in parentheses refer to the number of edited and wild-type plants examined, respectively. Error bars display SEM. Growth conditions were as described in (50).

Knowledge from model species can also be used to annotate genes and provide a route to trait enhancement through reverse genetics. The approach here targeted flowering time, which is important for crop adaptation to diverse environments and is well studied in model plants. Six wheat homologs of the FLOWERING LOCUS C (FLC) gene have been identified as having a role in the vernalization response, a critical process regulating flowering time (50). IWGSC RefSeq v1.0 was used to refine the annotation of these six sequences to identify four HC genes and then to design guide RNAs to specifically target, with CRISPR-Cas9–based gene editing, one of these genes, TaAGL33, on all subgenomes [TraesCS3A01G435000 (A), TraesCS3B01G470000 (B), and TraesCS3D01G428000 (D)] [Fig. 5C and (14)]. Editing was obtained at the targeted gene and led to truncated proteins after the MADS box through small deletions and insertions (Fig. 5D). Expression of all homeologs was high before vernalization, dropped during vernalization, and remained low post-vernalization, implying a role for this gene in flowering control. This expression pattern was not strongly affected by the genome edits (fig. S59). Plants with the editing events in the D subgenome flowered 2 to 3 days earlier than controls (Fig. 5E). Further refinement should help to fully elucidate the importance of the TaAGL33 gene for vernalization in monocots. These results exemplify how the IWGSC RefSeq v1.0 could accelerate the development of diagnostic markers and the design of targets for genome editing for traits relevant to breeding.

Conclusions

IWGSC RefSeq v1.0 is a resource that has the potential for disruptive innovation in wheat improvement. By necessity, breeders work with the genome at the whole-chromosome level, as each new cross involves the modification of genome-wide gene networks that control the expression of complex traits such as yield. With the annotated and ordered reference genome sequence in place, researchers and breeders can now easily access sequence-level information to define changes in the genomes of lines in their programs. Although several hundred wheat QTLs have been published, only a small number of genes have been cloned and functionally characterized. IWGSC RefSeq v1.0 underpins immediate application by providing access to regulatory regions, and it will serve as the backbone to anchor all known QTLs to one common annotated reference. Combining this knowledge with the distribution of meiotic recombination frequency and genomic diversity will enable breeders to more efficiently tackle the challenges imposed by the need to balance the parallel selection processes for adaptation to biotic and abiotic stress, end-use quality, and yield improvement. Strategies can now be defined more precisely to bring desirable alleles into coupling phase, especially in less-recombinant regions of the wheat genome. Here the full potential of the newly available genome information may be realized through the implementation of DNA-marker platforms and targeted breeding technologies, including genome editing (51).

Methods summary

Whole-genome sequencing of cultivar Chinese Spring by short-read sequencing-by-synthesis provided the data for de novo genome assembly and scaffolding with the software package DenovoMAGIC2. The assembly was superscaffolded and anchored into 21 pseudomolecules with high-density genetic (POPSEQ) and physical (Hi-C and 21 chromosome-specific physical maps) mapping information and by integrating additional genomic resources. Validation of the assembly used independent genetic (de novo genotyping-by-sequencing maps) and physical mapping evidence (radiation hybrid maps, BioNano “optical maps” for group 7 homeologous chromosomes). The genome assembly was annotated for genes, repetitive DNA, and other genomic features, and in-depth comparative analyses were carried out to analyze the distribution of genes, recombination, position, and size of centromeres and the expansion and contraction of wheat gene families. An atlas of wheat gene transcription was built from an extensive panel of 850 independent transcriptome datasets and was then used to study gene coexpression networks. Furthermore, the assembly was used for the dissection of an important stem-solidness QTL and to design targets for genome editing of genes implicated in flowering-time control in wheat. Detailed methodological procedures are described in the supplementary materials.

Supplementary Materials

www.sciencemag.org/content/361/6403/eaar7191/suppl/DC1

Materials and Methods

Figs. S1 to S59

Tables S1 to S43

References (56186)

Databases S1 to S5

  • Authorship of this paper should be cited as “International Wheat Genome Sequencing Consortium” (IWGSC, 2018).

References and Notes

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.See supplementary materials.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.
  75. 75.
  76. 76.
  77. 77.
  78. 78.
  79. 79.
  80. 80.
  81. 81.
  82. 82.
  83. 83.
  84. 84.
  85. 85.
  86. 86.
  87. 87.
  88. 88.
  89. 89.
  90. 90.
  91. 91.
  92. 92.
  93. 93.
  94. 94.
  95. 95.
  96. 96.
  97. 97.
  98. 98.
  99. 99.
  100. 100.
  101. 101.
  102. 102.
  103. 103.
  104. 104.
  105. 105.
  106. 106.
  107. 107.
  108. 108.
  109. 109.
  110. 110.
  111. 111.
  112. 112.
  113. 113.
  114. 114.
  115. 115.
  116. 116.
  117. 117.
  118. 118.
  119. 119.
  120. 120.
  121. 121.
  122. 122.
  123. 123.
  124. 124.
  125. 125.
  126. 126.
  127. 127.
  128. 128.
  129. 129.
  130. 130.
  131. 131.
  132. 132.
  133. 133.
  134. 134.
  135. 135.
  136. 136.
  137. 137.
  138. 138.
  139. 139.
  140. 140.
  141. 141.
  142. 142.