Research Article

Shifting the limits in wheat research and breeding using a fully annotated reference genome

See allHide authors and affiliations

Science  17 Aug 2018:
Vol. 361, Issue 6403, eaar7191
DOI: 10.1126/science.aar7191
  • Wheat genome deciphered, assembled, and ordered.

    Seeds, or grains, are what counts with respect to wheat yields (left panel), but all parts of the plant contribute to crop performance. With complete access to the ordered sequence of all 21 wheat chromosomes, the context of regulatory sequences, and the interaction network of expressed genes—all shown here as a circular plot (right panel) with concentric tracks for diverse aspects of wheat genome composition—breeders and researchers now have the ability to rewrite the story of wheat crop improvement. Details on value ranges underlying the concentric heatmaps of the right panel are provided in the full article online.

  • Fig. 1 Structural, functional, and conserved synteny landscape of the 21 wheat chromosomes.

    (A) Circular diagram showing genomic features of wheat. The tracks toward the center of the circle display (a) chromosome name and size (100-Mb tick size; light gray bar indicates the short arm and dark gray indicates the long arm of the chromosome); (b) dimension of chromosomal segments R1, R2a, C, R2b, and R3 [(18) and table S29]; (c) K-mer 20-frequencies distribution; (d) LTR-retrotransposons density; (e) pseudogenes density (0 to 130 genes per Mb); (f) density of HC gene models (0 to 32 genes per Mb); (g) density of recombination rate; and (h) SNP density. Connecting lines in the center of the diagram highlight homeologous relationships of chromosomes (blue lines) and translocated regions (green lines). (B) Distribution of Pfam domain PF08284 “retroviral aspartyl protease” signatures across the different wheat chromosomes. (C) Positioning of the centromere in the 2D pseudomolecule. Top panel shows density of CENH3 ChIP-seq data along the wheat chromosome. Bottom panel shows distribution and proportion of the total pseudomolecule sequence composed of TEs of the Cereba and Quinta families. The bar below the bottom panel indicates pseudomolecule scaffolds assigned to the short (black) or long (blue) arm on the basis of CSS data (6) mapping. (D) Dot-plot visualization of collinearity between homeologous chromosomes 3A and 3B in relation to distribution of gene density and recombination frequency (left and bottom panel boxes: blue and purple lines, respectively). Chromosomal zones R1, R2a, C, R2b, and R3 are colored as in (A). cM, centimorgan.

  • Fig. 2 Evaluation of automated gene annotation.

    (A) Selected gene prediction statistics of IWGSC RefSeq Annotation v1.1, including number and subgenome distribution of HC and LC genes as well as pseudogenes. (B) BUSCO v3 gene model evaluation comparing IWGSC RefSeq Annotation v1.1 to earlier published bread wheat whole-genome annotations, as well as to annotations of related grass reference-genome sequences. BUSCO provides a measure for the recall of highly conserved gene models.

  • Fig. 3 Wheat atlas of transcription.

    (A) Schematic illustration of a mature wheat plant and high-level tissue definitions for “roots,” “leaves,” “spike,” and “grain” used in the further analysis. (B) Principal component (PC) analysis plots for similarity of overall transcription, with samples colored according to their high-level tissue of origin [as introduced in (A)]. The color key for tissue is shown at the bottom of the figure under (C). (C) Chromosomal distribution of the average expression breadth [number of tissues in which genes are expressed (total number of tissues, n = 32)]. The average (dark orange line) is calculated on the basis of a scaled position of each gene within the corresponding genomic compartment (blue, aqua, and light yellow background) across the 21 chromosomes (orange lines). (D) Heatmap illustrating the expression of a representative gene (eigengene) for the 38 coexpression modules defined by WGCNA. Modules are represented as columns, with the dendrogram illustrating eigengene relatedness. Each row represents one sample. Colored bars to the left indicate the high-level tissue of origin; the color key is shown at the bottom of the figure under (C). DESeq2-normalized expression levels are shown. Modules 1 and 5 (light green boxes) were most correlated with high-level leaf tissue, whereas modules 8 and 11 (dark green boxes) were most correlated with spike. (E) Bar plot of module assignment (same, near, or distant) of homeologous triads and duplets in the WGCNA network. (F) Simplified flowering pathway in polyploid wheat. Genes are colored according to their assignment to leaf (light green)– or spike (dark green)–correlated modules. (G) Excerpt from phylogenetic tree for MADS transcription factors, including known Arabidopsis flowering regulators SEP1, SEP2, and SEP4 (black) (for the full phylogenetic tree, see fig. S38). Green branches represent wheat orthologs of modules 8 and 11, whereas purple branches are wheat orthologs assigned to other modules (0 and 2). Gray branches indicate non-wheat genes.

  • Fig. 4 Gene families of wheat.

    (A) Heatmap of expanded and contracted gene families. Columns correspond to the individual gene families. Rows in the top panel illustrate the sets of gene-family expansions (++, red) and contractions (––, blue) found for the wheat A lineage (Triticum urartu and A subgenome); the D lineage (Aegilops tauschii and D subgenome); the A, B, or D subgenomes; or bread wheat (expanded and contracted in all subgenomes). In the latter four categories, expansions and contractions do not imply bread wheat–specific gene copy number variations. Similar dynamics might have remained unobserved in T. urartu or A. tauschii owing to the inherent limitations of the used draft genome assemblies (53, 54). Rows in the bottom panel heatmap (color scheme on z-score scale) indicate the fold expansion and contraction of gene families for the taxa and species included in the analysis [Oryza sativa (Osat), Sorghum bicolor (Sbic), Zea mays (Zmay), Brachypodium distachyon (Bdis), Hordeum vulgare (Hvul1/2), Secale cereale (Scer), A. tauschii (Aetau), T. urartu (Tura), and wheat A (TraesA), B (TraesB), and D (TraesD) subgenomes]. (B) All enriched TO terms for the gene families depicted in (A). Overrepresented TO terms were found for expanded families in bread wheat (all subgenomes, red), the B subgenome (green), and the A lineage (T. urartu and A subgenome, blue) only, respectively. The x axis represents the percentage of genes annotated with the respective TO term that were contained in the gene set in question. The size of the bubbles corresponds to the P (−log10) significance of expansion. (C) Genomic distribution of gene families associated with adaptation to biotic (light and dark blue) or abiotic stress (light and dark pink), RNA metabolism in organelles and male fertility (orange), or end-use quality (light, medium, and dark green). Known positions of agronomically important genes and loci are indicated by red arrows and arrowheads to the left of the chromosome bars. Recombination rates are displayed as heatmaps in the chromosome bars [7.2 cM/Mb (light green) to 0 cM/Mb (black)].

  • Fig. 5 IWGSC RefSeq v1.0–guided dissection of SSt1 and TaAGL33.

    (A) The Lillian-Vesper population genetic map was anchored to IWGSC RefSeq v1.0 (left), and differentially expressed genes were identified between solid- and hollow-stemmed lines of hexaploid (bread) and tetraploid (durum) wheat (right). (B) Cross-sectioned stems of Lillian (solid) and Vesper (hollow) are shown as a phenotypic reference (top). Increased copy number of TraesCS3B01G608800 [annotated as a DOF (DNA-binding one-zinc finger) transcription factor] is associated with stem phenotypic variation (bottom). (C) A high-throughput SNP marker tightly linked to TraesCS3B01G608800 reliably discriminates solid- from hollow-stemmed wheat lines. Relative intensity of the fluorophores (FAM and HEX) used in KASPar analysis are shown. Vertical axis shows FAM signal; horizontal axis shows HEX signal. (D) Schematic of the three TaAGL33 proteins, showing the typical MADS, I, K, and C domains. Triangles indicate the position of the five introns that occur in all three homeologs. Bars indicate the position of single-guide RNAs designed for exons 2 and 3. Three T-DNA vectors—each containing the bar selectable marker gene, CRISPR nuclease, and one of three single-guide RNA sequences—were used for Agrobacterium-mediated wheat transformation, essentially as described earlier (55). Transgenic plants were obtained with edits at the targeted positions in all TaAGL33 homeologs. The putatively resulting protein sequence is displayed starting close to the edits, with wild-type amino acids (aa) in black font and amino acids resulting from the induced frame shifts in red font. * indicates premature termination codons. (E) Mean days to flowering (after 8 weeks of vernalization) for progeny of four homozygous edited plants (light gray bars) and the respective homozygous wild-type segregants (dark gray bars). Numbers in parentheses refer to the number of edited and wild-type plants examined, respectively. Error bars display SEM. Growth conditions were as described in (50).

  • Table 1 Assembly statistics of IWGSC RefSeq v1.0.

    Assembly characteristicsValues
    Assembly size14.5 Gb
    Number of scaffolds138,665
    Size of assembly in scaffolds ≥ 100 kb14.2 Gb
    Number of scaffolds ≥ 100 kb4,443
    N50 contig length51.8 kb
    Contig L50 number81,427
    N90 contig length11.7 kb
    Contig L90 number294,934
    Largest contig580.5 kb
    Ns in contigs0
    N50 scaffold length7.0 Mb
    Scaffold L50 number571
    N90 scaffold length1.2 Mb
    Scaffold L90 number2,390
    Largest scaffold45.8 Mb
    Ns in scaffolds261.9 Mb
    Gaps filled with BAC sequences183 (1.7 Mb)
    Average size of inserted BAC sequence9.5 kb
    N50 superscaffold length22.8 Mb
    Superscaffold L50 number166
    N90 superscaffold length4.1 Mb
    Superscaffold L90 number718
    Largest superscaffold165.9 Mb
    Sequence assigned to chromosomes14.1 Gb (96.8%)
    Sequence ≥ 100 kb assigned to chromosomes14.1 Gb (99.1%)
    Number of superscaffolds on chromosomes1,601
    Number of oriented superscaffolds1,243
    Length of oriented sequence13.8 Gb (95%)
    Length of oriented sequence ≥ 100 kb13.8 Gb (97.3%)
    Smallest number of superscaffolds per subgenome chromosome35 (7A), 68 (2B), 36 (1D)
    Largest number of superscaffolds per subgenome chromosome111 (4A), 176 (3B), 90 (3D)
    Average number of superscaffolds per chromosome76
  • Table 2 Relative proportions of the major elements of the wheat genome.

    Proportions of TEs are given as the percentage of sequences assigned to each superfamily relative to genome size. Abbreviations in parentheses under the headings “Class 1” and “Class 2” indicate transposon types.

    Major elementsWheat subgenome
    AABBDDTotal
    Assembled sequence assigned to chromosomes (Gb)4.9355.1803.95114.066
    Size of TE-related sequences (Gb)4.2404.3883.28511.913
    TEs (%)85.984.783.184.7
    Class 1
     LTR-retrotransposons
       Gypsy (RLG)50.846.841.446.7
       Copia (RLC)17.416.216.316.7
       Unclassified LTR-retrotransposons (RLX)2.63.53.73.2
     Non-LTR-retrotransposons
       Long interspersed nuclear elements (RIX)0.810.960.930.90
       Short interspersed nuclear elements (SIX)0.010.010.010.01
    Class 2
     DNA transposons
       CACTA (DTC)12.815.519.015.5
       Mutator (DTM)0.300.380.480.38
       Unclassified with terminal inverted repeats0.210.200.220.21
       Harbinger (DTH)0.150.160.180.16
       Mariner (DTT)0.140.160.170.16
       Unclassified class 20.050.080.050.06
       hAT (DTA)0.010.010.010.01
     Helitrons (DHH)0.00460.00440.00360.0042
    Unclassified repeats0.550.850.630.68
    Coding DNA0.890.891.110.95
    Unannotated DNA13.214.415.714.4
    (Pre)-microRNAs0.0390.0570.0460.047
    tRNAs0.00560.00500.00680.0057
  • Table 3 Groups of homeologous genes in wheat.

    Homeologous genes are “subgenome orthologs” and were inferred by species tree reconciliation in the respective gene family. Numbers include both HC and LC genes filtered for TEs (filtered gene set). Conserved subgenome-specific (orphan) genes are found only in one subgenome but have homologs in other plant genomes used in this study. This includes orphan outparalogs resulting from ancestral duplication events and conserved only in one of the subgenomes. Nonconserved orphans are either singletons or duplicated in the respective subgenome, but neither have obvious homologs in the other subgenomes or the other plant genomes studied. Microsynteny is defined as the conservation and collinearity of local gene ordering between orthologous chromosomal regions. Macrosynteny is defined as the conservation of chromosomal location and identity of genetic markers like homeologs but may include the occurrence of local inversions, insertions, or deletions. Additional data are presented in table S24.

    Homeologous group (A:B:D)Number in wheat
    genome
    Composition of
    groups (%)
    Number of
    genes in A
    Number of
    genes in B
    Number of
    genes in D
    Total number of
    genes
    1:1:121,60355.121,60321,60321,60364,809
    1:1:N6441.66446441,4822,770
    1:N:19982.59982,3969984,392
    N:1:17611.91,7527617613,274
    1:1:03,7089.53,7083,70807,416
    1:0:14,05710.34,05704,0578,114
    0:1:14,19710.704,1974,1978,394
    Other ratios3,2708.34,9995,3714,11414,484
    1:1:1 in microsynteny18,59547.418,59518,59518,59555,785
    Total in microsynteny30,33977.327,24027,06328,00582,308
    1:1:1 in macrosynteny19,70150.219,70119,70119,70159,103
    Total in macrosynteny32,59183.129,06430,61530,55390,232
    Total in homeologous groups39,238100.037,76138,68037,212113,653
    Conserved subgenome orphans12,41212,98710,84436,243
    Nonconserved subgenome singletons10,08412,1858,67930,948
    Nonconserved subgenome duplicated orphans718338192
    Total (filtered)60,32863,93556,773181,036

Supplementary Materials

  • Shifting the limits in wheat research and breeding using a fully annotated reference genome

    International Wheat Genome Sequencing Consortium (IWGSC)

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Materials and Methods 
    • Figs. S1 to S59 
    • Tables S1 to S43 
    • Captions for Databases S1 to S5 
    • References 
    Data S1
    Metadata of 850 RNAseq samples used in the study
    Data S2
    SlimGO ubiquitous and Tissue-exclusive genes.
    Data S3
    GO terms of the WGCNA850 analysis.
    Data S4
    WGCNA Module Assignment.
    Data S5
    Module 8 and 11 TF Arabidopsis and rice orthologs.

Stay Connected to Science

Navigate This Article