Research Article

The transcriptional landscape of polyploid wheat

See allHide authors and affiliations

Science  17 Aug 2018:
Vol. 361, Issue 6403, eaar6089
DOI: 10.1126/science.aar6089

Insights from the annotated wheat genome

Wheat is one of the major sources of food for much of the world. However, because bread wheat's genome is a large hybrid mix of three separate subgenomes, it has been difficult to produce a high-quality reference sequence. Using recent advances in sequencing, the International Wheat Genome Sequencing Consortium presents an annotated reference genome with a detailed analysis of gene content among subgenomes and the structural organization for all the chromosomes. Examples of quantitative trait mapping and CRISPR-based genome modification show the potential for using this genome in agricultural research and breeding. Ramírez-González et al. exploited the fruits of this endeavor to identify tissue-specific biased gene expression and coexpression networks during development and exposure to stress. These resources will accelerate our understanding of the genetic basis of bread wheat.

Science, this issue p. eaar7191; see also p. eaar6089

Structured Abstract

INTRODUCTION

Polyploidy, arising from whole-genome duplication or interspecific hybridization, is ubiquitous across the plant and fungal kingdoms. The presence of highly related genes in polyploids, referred to as homoeologs, has been proposed to confer adaptive plasticity—for example, through neofunctionalization of duplicated genes or tissue-specific expression. This plasticity has facilitated the domestication and adaptation of major polyploid crops (e.g., wheat, cotton, and coffee). However, despite its likely importance, we have a limited understanding of the effect of polyploidy on gene expression and the extent to which homoeologs are similar or different in their expression patterns across development and tissues.

RATIONALE

Bread wheat is a polyploid derived from the hybridizations between three distinct diploid species and is an informative system for analyzing the effects of recent polyploidy on gene expression. Understanding the coordination of homoeologs and identifying the mechanisms associated with these processes should help define strategies to improve trait biology in a crop that provides more than 20% of the protein and caloric intake of humans.

RESULTS

Here we leverage 850 wheat RNA-sequencing samples, alongside the annotated genome, to determine the similarities and differences between homoeolog expression across a range of tissues, developmental stages, and cultivars. On average, ~30% of wheat homoeolog triads (composed of A, B, and D genome copies) showed nonbalanced expression patterns, with higher or lower expression from a single homoeolog with respect to the other two. These differences between homoeologs were associated with epigenetic changes affecting DNA methylation and histone modifications. Although nonbalanced homoeolog expression could be partially predicted by expression in diploid ancestors, large changes in relative homoeolog expression were observed owing to polyploidization.

Our results suggest that the transposable elements in promoters relate more closely to the variation in the relative expression of homoeologs across tissues than to a ubiquitous effect across all tissues. We found that homoeologs with the highest inter-tissue variation had promoters with more frequent transposable element insertions and more varied cis-regulatory elements than homoeologs that were stable across tissues. We also identified expression asymmetry along wheat chromosomes. Homoeologs with the largest inter-tissue, inter-cultivar, and coding sequence variation were most often located in the highly recombinogenic distal ends of chromosomes. These transcriptionally dynamic homoeologs are under more relaxed selection pressure, potentially representing the first steps toward functional innovation through neo- or subfunctionalization.

We generated tissue- and stress-specific coexpression networks that reveal extensive coordination of homoeolog expression throughout development. These networks, alongside detailed gene expression atlases (www.wheat-expression.com and http://bar.utoronto.ca), lay the groundwork to identify candidate genes influencing agronomic traits in wheat.

CONCLUSION

This study provides detailed insights into the transcriptional landscape of bread wheat, an evolutionarily young polyploid. Our work shows that homoeolog expression patterns in bread wheat have been shaped by polyploidy and are associated with both epigenetic modifications and variation in transposable elements within promoters of homoeologous genes. The extensive datasets and analyses presented here provide a framework that can help researchers and breeders develop strategies to improve crops by manipulating individual or multiple homoeologs to modulate trait responses.

Homoeolog expression patterns in polyploid wheat.

Seventy percent of triads (A, B, and D homoeologs) show balanced expression among homoeologs and are ubiquitously expressed (left), whereas ~30% show nonbalanced expression and are more tissue-specific (right; symbolized by three exemplar tissues). Variation in promoter elements and nonsynonymous substitution rates distinguish between individual triads with stable relative expression across tissues and triads with more inter-tissue variation (tissue-dynamic triads).

Abstract

The coordinated expression of highly related homoeologous genes in polyploid species underlies the phenotypes of many of the world’s major crops. Here we combine extensive gene expression datasets to produce a comprehensive, genome-wide analysis of homoeolog expression patterns in hexaploid bread wheat. Bias in homoeolog expression varies between tissues, with ~30% of wheat homoeologs showing nonbalanced expression. We found expression asymmetries along wheat chromosomes, with homoeologs showing the largest inter-tissue, inter-cultivar, and coding sequence variation, most often located in high-recombination distal ends of chromosomes. These transcriptionally dynamic genes potentially represent the first steps toward neo- or subfunctionalization of wheat homoeologs. Coexpression networks reveal extensive coordination of homoeologs throughout development and, alongside a detailed expression atlas, provide a framework to target candidate genes underpinning agronomic traits in wheat.

Polyploidy arises from whole-genome duplication or interspecific hybridization and is ubiquitous in eukaryotic plant and fungal lineages. Polyploidy has been proposed to confer adaptive plasticity, thereby shaping the evolution of plants, fungi, and, to a lesser degree, animals (1, 2). This plasticity has facilitated the domestication and adaptation of several major crop species (3), including hexaploid bread wheat (Triticum aestivum; AABBDD subgenome), which is derived from relatively recent interspecific hybridizations between three different diploid species. In such polyploids, gene duplication alters the transcriptional landscape (4) by providing additional flexibility to adapt and evolve new patterns of gene expression for homoeologous gene copies (5). This flexibility has been suggested to be an important mechanism for controlling adaptive traits (6, 7)—for example, through neofunctionalization of duplicated genes (8) or tissue-specific expression (9). However, despite the likely importance of polyploidy in affecting gene expression, we have a limited understanding of the extent to which homoeologs resemble or differ from each other in their expression patterns, the spatiotemporal dynamics of these relationships, and how epistatic interactions between individual homoeologs affect biological traits. The new genomic resources available for wheat (10), along with its meiotic stability (11) and syntenic gene order (12), make it a particularly informative system for gaining insight into the effects of recent polyploidy on gene expression.

In this study, we leveraged available RNA sequencing (RNA-seq) data (529 samples from 28 studies) and added 321 samples to explore global gene expression in hexaploid wheat across a diverse range of tissues, developmental stages, cultivars, and environmental conditions (13). We organized these sets of RNA-seq samples into partially overlapping datasets from (i) a single developmental time course experiment (n = 209 samples), (ii) the reference accession Chinese Spring (CS) under nonstress conditions (n = 123 samples), (iii) four main tissue types under nonstress conditions (n = 537 samples), and (iv) seedling samples from abiotic (n = 50) and biotic (n = 163) stress experiments including controls (table S1). These datasets, alongside a complete and annotated genome and transcriptome (10), provide an opportunity to conduct homoeolog-specific transcriptome profiling and to generate gene regulatory networks to better understand the spatiotemporal coordination of individual homoeologs underlying trait biology on a genome-wide scale.

A developmental gene expression atlas in polyploid wheat

We first assessed expression patterns through a developmental time course of the commercial wheat cultivar Azhurnaya, including 209 RNA-seq samples representing 22 tissue types from grain, root, leaf, and spike samples across multiple time points (Fig. 1). We quantified expression using pseudoalignment of RNA-seq reads to the RefSeqv1.0 transcriptome, as implemented in kallisto (14), which accurately quantifies reads in a homoeolog-specific manner in polyploid wheat (13, 15) (figs. S1 and S2). We found evidence of expression for 83,741 (75.6% of 110,790) high-confidence genes, on the basis of expression of >0.5 transcripts per million (TPM) in at least one of the 22 tissue types, and we conducted complexity (table S2) and differential expression analyses (fig. S3). Tissue type distinguished samples across development (fig. S4) (13), consistent with observations in other plant and animal species (16, 17). Within similar tissue types, subgenome of origin also influenced expression patterns, consistent with previous results in wheat grain samples (18). This gene expression atlas provides a valuable resource for breeders and researchers to query for and analyze their genes of interest through www.wheat-expression.com (15) and the Wheat eFP Browser at http://bar.utoronto.ca/efp_wheat/cgi-bin/efpWeb.cgi (fig. S5) (19).

Fig. 1 Developmental time course of bread wheat.

Shown is a schematic overview of tissues sampled for the RNA-seq expression atlas across multiple growth stages (labeled in blue). Details of all samples are provided in table S1.

Homoeolog expression patterns

In polyploid wheat, quantitative variation for many agronomic traits is modulated by genetic interactions between multiple sets of homoeologs in the A, B, and D subgenomes (20). These interactions range from buffering effects observed when gene homoeologs are functionally redundant (21) to dominance effects where variation in a single homoeolog can lead to dominant phenotypes (22). Understanding how these interactions influence gene expression will help inform strategies to improve crops by targeting and manipulating individual or multiple homoeologs to quantitatively modulate trait responses (20).

To determine patterns of homoeolog expression, we analyzed 123 RNA-seq samples representing 15 tissues under nonstress conditions (table S1) from CS. This was the same accession used to generate the reference genome (10); thus, cultivar-specific polymorphisms were excluded from our analysis. We found evidence of expression for 82,567 (74.5%) high-confidence genes, consistent with the developmental time course of the cultivar Azhurnaya. We focused on 53,259 genes that had a 1:1:1 correspondence across the three homoeologous subgenomes, referred to as triads, and a summed expression of >0.5 TPM across the triad (64.5% of expressed genes, 96.1% of all triads; table S3). The majority of these expressed triads (94.3%) were in ancestral (i.e., syntenic) physical positions in at least two of the three subgenomes [50,238 genes corresponding to 16,746 syntenic triads (10)], whereas 5.7% (1007 triads) had all genes in nonsyntenic positions and are thus referred to as nonsyntenic triads. For each of the 17,753 expressed triads, we standardized the relative expression of the A, B, and D subgenome homoeologs (fig. S6) so that the sum was 1.0 in each individual tissue. In this way, the relative abundance of homoeolog expression is comparable within triads, as well as across tissues, allowing the study of homoeolog expression bias (23).

We performed a global analysis combining data across all 15 tissues and focused on the 16,746 syntenic triads, but we discuss patterns in nonsyntenic triads where relevant. We found that the D subgenome had a subtly yet significantly higher relative abundance (33.65%) than the B (33.29%) and A (33.06%) subgenomes (Kruskal-Wallis P < 0.001; tables S4 and S5). The homoeolog expression bias of the D subgenome is unlikely to reflect technical issues (fig. S2) and was found in 11 of the 15 tissues, was consistent across multiple expression abundance cutoffs, and was also significant in the developmental time course (in all 22 tissues) of the cultivar Azhurnaya [figs. S7 to S9 and table S5 (13)]. This effect, however, is subtle when compared with the homoeolog expression bias observed in evolutionarily older polyploid crops such as cotton, in which genome doubling occurred at an earlier time point (24).

The relative expression of each homoeolog determined a triad’s position in the ternary plot for the global analysis (Fig. 2A), as well as for analyses of individual tissues (figs. S6 and S10). From these plots, we defined seven homoeolog expression bias categories (13): a balanced category, with similar relative abundance of transcripts from the three homoeologs, and six homoeolog-dominant or homoeolog-suppressed categories, classified on the basis of the higher or lower abundance of transcripts from a single homoeolog with respect to those from the other two (Fig. 2A). Most syntenic triads (72.5%) were assigned to the balanced category within each tissue, with balanced triads ranging from 62.6% in the stigma and ovary to 78.9% in roots (Fig. 2B and table S6). Triads with single-homoeolog dominance were infrequent (7.1%; range among tissues, 4.7 to 11.3%), whereas syntenic triads classified as single-homoeolog–suppressed were more common (20.5%; range, 16.3 to 27.1%; Fig. 2B). These patterns shifted significantly in the 1007 nonsyntenic triads, which had fewer balanced triads (58.9%) and a higher proportion of dominant (14.5%) and suppressed (26.6%) triads across tissues (χ2 P < 0.001; tables S7 and S8). Across tissues, no differences were observed in the frequency of single-homoeolog dominance between subgenomes (tables S6 and S7). However, across all 15 tissues, D-homoeolog suppression was significantly less frequent (5.7%) than either A- or B-homoeolog suppression (7.5 and 7.2%, respectively; Kruskal-Wallis P < 0.05), and this pattern was also observed in nonsyntenic triads and the developmental time course (tables S6 to S8). This pattern explains in part the subtle homoeolog expression bias observed for the D subgenome relative to those observed for the A- and B-subgenome homoeologs. This observation is consistent with the lower distribution of repressive H3K27me3 (histone H3 lysine 27 trimethylation) histone marks across the gene body of D-subgenome homoeologs compared with those of the A- and B-subgenome homoeologs (fig. S11).

Fig. 2 Homoeolog expression bias in syntenic homoeolog triads.

(A) Ternary plot showing relative expression abundance of 16,746 syntenic triads (50,238 genes) in hexaploid wheat in the combined analysis of 15 tissues from Chinese Spring. Each circle represents a gene triad with an A, B, and D coordinate consisting of the relative contribution of each homoeolog to the overall triad expression (an example is shown on the top left). Triads in vertices correspond to single-subgenome–dominant categories, whereas triads close to edges and between vertices correspond to suppressed categories. Balanced triads are shown in gray. Box plots indicate the relative contribution of each subgenome based on triad assignment to the seven categories. (B) Proportion of triads in each category of homoeolog expression bias across the 15 tissues (excl, excluding). (C) Box plot of absolute TPM expression abundance for each subgenome from the seven categories. (D) Number of tissues in which homoeolog-suppressed (brown), homoeolog-dominant (teal), and balanced (gray) triads are expressed. (E) Metagene profile for histone H3K27me3 marks from –2 kb upstream of the ATG to +2 kb downstream of the stop codon (normalized for gene length) for balanced triads (gray), dominant triads separated into dominant (teal) and nondominant (pale blue) homoeologs, and suppressed triads separated into suppressed (tan) and nonsuppressed (brown) homoeologs.

Genes from syntenic triads in the balanced category were expressed across a wider range of tissues and had higher absolute transcript abundance, on a per-subgenome basis (12.2 tissues; median, 4.03 TPM), than genes in the suppressed (9.1 tissues; median, 1.51 TPM) or dominant (6.9 tissues; median, 0.57 TPM) categories (two-sample Kolmogorov–Smirnov test, P < 0.001; Fig. 2, C and D, and tables S9 and S10). The absolute transcript abundance data show that dominant triads are not the result of an overall increase in expression of a single homoeolog, but rather result from the relatively lower expression of the two other homoeologs.

To determine if the differences among homoeologs are a consequence of polyploidization, we analyzed RNA-seq data from diploid and tetraploid progenitor species and newly created synthetic hexaploid wheat (SHW) lines (25). We found that 67.5% of nonbalanced triads in modern-day wheat have a different homoeolog expression bias category than that observed in SHW, with all three subgenomes being equally affected (table S11 and figs. S12 to S14). Likewise, 47.1% of nonbalanced triads in SHW are in a different category than would be expected on the basis of the progenitor species, with D-subgenome homoeologs most strongly influenced (13) (table S12 and fig. S15). These results suggest that the polyploid context and the polyploidization process itself affect the relative expression of homoeologs compared with the baseline expression in the progenitor species (13), which has also been observed during the evolution of polyploid cotton (26) and monkeyflower (27).

We hypothesized that epigenetic mechanisms might be associated with differences in homoeolog expression patterns. To test this, we examined the associations of transposable elements (TEs), DNA methylation, and histone modifications with the relative expression of triads in leaves of CS. We found no clear relationship between the presence of TEs in promoter regions and altered expression patterns between homoeologs in dominant and suppressed triads (Tukey’s Honestly Significant Difference P > 0.6; fig. S16 and table S13) (13). However, we identified significant differences in gene-body DNA methylation and histone modifications among homoeologs (13).

Gene-body CG methylation is widely conserved in angiosperms, although its functional significance is currently under debate (28, 29), given that two angiosperm species lack this epigenetic mark altogether (30). We found higher gene-body CG methylation in constitutively expressed triads than in more tissue-specific triads (balanced > suppressed > dominant; fig. S17). Within the nonbalanced triads, homoeologs with higher expression had higher CG methylation than their corresponding nondominant and suppressed homoeologs (Mann-Whitney P < 0.001; fig S17). These results are consistent with gene-body CG methylation associated with housekeeping genes and its suggested role in homoeostatic gene expression (29). Similarly, the more highly expressed homoeologs within nonbalanced triads had higher active (H3K36me3 and H3K9ac; ac, acetylation) histone marks and lower repressive (H3K27me3) histone marks in the gene body (Mann-Whitney P < 0.001; fig. S11). For H3K27me3, these differences were not limited strictly to the gene body but extended into the upstream and downstream regions for both dominant and suppressed triads (Fig. 2E), consistent with the tight association of H3K27me3 with inactive gene promoters (31). These results suggest that epigenetic status in gene bodies, as well as upstream and downstream regions, is associated with homoeolog expression bias in polyploid wheat, consistent with results in monkeyflower showing changes in DNA methylation upon polyploidization (27).

Breeders rely on recombination to generate new combinations of haplotypes for improving cultivars. In wheat, chromosome position strongly influences recombination rates, with relatively low recombination rates in the interstitial and proximal regions (R2a, C, and R2b genomic compartments) but markedly higher rates toward the distal ends of the chromosomes (R1 and R3 genomic compartments) (32). In our analyses, syntenic triads in the balanced category were overrepresented in the low-recombination regions (R2 and C), which have higher levels of active histone marks (H3K36me3 and H3K9ac) (10), consistent with the higher expression of balanced triads. Homoeolog-dominant and homoeolog-suppressed triads were overrepresented toward the high-recombination distal ends of chromosomes (R1 and R3; χ2 P < 0.001; table S14), which have higher levels of repressive (H3K27me3) histone marks (10), consistent with the lower expression of dominant and suppressed triads. This pattern was also observed in the developmental time course of the cultivar Azhurnaya. However, when comparing the CS and Azhurnaya cultivars (nine tissues in common), we found that 84.5% of genes in the R2 and C regions had the same expression category between cultivars, whereas only 72.2% of genes in the R1 and R3 regions did so (χ2 P < 0.001; table S15). These differences in homoeolog expression bias across cultivars have important implications for breeding because they suggest that through genetic crosses, breeders not only generate new combinations of haplotypes with differential expression of alleles, but also rearrange and select for homoeolog expression bias between cultivars.

Variation of triad expression patterns

Polyploidy may confer phenotypic plasticity by allowing homoeologs to be expressed differently across tissues and/or environmental conditions (8). Our analyses above provide a static overview of the relative homoeolog expression bias in individual tissues. Therefore, we explored whether syntenic triads retain their homoeolog expression bias category across the 15 tissues (table S16). We found that 83.6% of balanced triads remained balanced in each of the 15 individual tissues, whereas dominant and suppressed triads tended to be more variable across tissues, with only 73.4 and 62.2%, respectively, staying within their global dominance group across all 15 tissues (Fig. 3A). Dominant and suppressed triads shifted most often to adjacent categories (16 to 20%) in the ternary plots and in few cases (<3.0%) changed to opposite categories (fig. S18). These patterns were also observed in the developmental time course (table S16). These data show that across tissues, triads most often remained consistent in their homoeolog expression bias classification, a phenomenon also seen across seven tissues in allotetraploid Tragopogon mirus (33).

Fig. 3 Variation of triad expression patterns.

(A) Variation of balanced, dominant, and suppressed triads (assigned on the basis of global analysis) across 15 tissues. (B) Distribution of mean distance of triad variation across 15 tissues for 14,258 triads expressed in at least six tissues. The 10% most stable (blue) and 10% most dynamic (red) triads were defined. (C) Ternary plots of representative stable and dynamic triads and (D) bar graphs of the distance between the triad position in the 15 individual tissues and the triad global average position (horizontal line). Color-coding is as in (A). (E) Number of tissues in which stable (blue) and dynamic (red) genes are expressed (table S36). (F) Homoeolog expression bias classification of stable and dynamic triads in global analysis. (G) Schematic representation of a wheat chromosome based on genomic compartments and features associated with distal (R1 and R3) and interstitial and proximal (R2 and C) regions. (H) Box plots of percent coding and protein sequence identity (left) and Ka/Ks ratio (right) for stable 10% (blue), middle 80% (gray), and dynamic 10% (red) triads.

To complement this analysis, we determined the variation in behavior of each triad within the ternary plot across the 15 tissues by calculating the mean distance between the triad’s position in each tissue and its global average position (13) (fig. S19). This generated a distribution of mean distances (Fig. 3B); we focused on the 10% most stable triads (defined as those having the shortest mean distances across tissues) and the 10% most dynamic triads (largest mean distances) (Fig. 3, B to D). Stable triads were expressed more highly than dynamic triads (median, 8.2 versus 3.2 TPM; P < 0.001) and had a higher expression breadth, being expressed across almost all samples, whereas dynamic triads were more tissue-specific (P < 0.001) (Fig. 3E and table S17). Stable triads were enriched for high-level gene ontology (GO-slim) terms associated with housekeeping processes (e.g. translation and cell cycle), whereas dynamic triads were enriched for defense and external stimuli responses and secondary metabolic processes, functions that more frequently determine differences in individual fitness (table S18) (6). In the global analysis, stable triads were significantly enriched for the balanced category (94.2%), whereas dynamic triads were almost equally spread between suppressed (37.9%), dominant (30.5%), and balanced (31.6%) categories (χ2 P < 0.001; Fig. 3F and table S19). This pattern is consistent with stable triads being most frequently located in proximal regions (C), whereas dynamic triads tend to locate in distal ends of chromosomes (R1 and R3) (χ2 P < 0.001; table S20). These results demonstrate expression asymmetry across wheat chromosomes, whereby the high-recombination distal ends of chromosomes have triads that exhibit higher homoeolog expression bias, are more dynamic across tissues, and have higher expression variation between cultivars than triads in the low-recombination proximal regions (Fig. 3G). This asymmetry is also reflected in the contrasting distributions of histone marks and DNA methylation along chromosomes (10). The difference in epigenetic marks identified in leaves makes it tempting to speculate that epigenetic marks may also be associated with triad expression variation across multiple tissues.

We next investigated if divergence in spatial expression patterns (as measured by the mean distance statistic described above) was coupled with 5′ promoter and protein sequence divergence in syntenic triads (13). The 1.5–kilobase (kb) promoters of dynamic triads more frequently contained TEs (88.3 versus 79.2%), which were closer to the translation start site (1113 versus 1234 bp away) but shorter (median, 220 versus 259 bp), than those in stable triads, leading to equivalent TE densities (all comparisons, Kruskal-Wallis P < 0.001; fig. S20 and table S13). These closer, more frequent, and shorter TEs could potentially act as novel cis-regulatory elements (34) or influence epigenetic marks (35). These results indicate that the promoter TE landscape relates more closely to the variation in the relative expression of homoeologs across tissues than to a ubiquitous effect across all tissues (table S13). Although only subtle differences in sequence identity were found between stable and dynamic triads (85.5 versus 85.0%; P = 0.045) (Fig. 3H and table S21), dynamic triads had fewer conserved transcription factor (TF) binding site motifs across the three homoeologs (37% fewer; P < 0.001; fig. S21). Across coding sequences, we showed a stepwise decrease in conservation of both the nucleotide and protein identities from stable (average, 97.2% coding sequence and 97.3% protein) to dynamic (95.0 and 93.4%) triads (both P < 0.001; table S21). We compared nonsynonymous (Ka) with synonymous (Ks) substitution rates between homoeologs and observed that dynamic triads had significantly higher Ka/Ks than stable triads (0.33 versus 0.21; Mann Whitney P < 0.001; Fig. 3H and table S22). This higher ratio suggests that triads with greater divergence in spatial expression patterns are under more relaxed selection pressure, as seen for duplicated genes in humans (9), but not in soybean (36) and carp (37). This conclusion is supported by the observation that nonsyntenic triads, which had greater expression divergence (10.5% larger mean distance; Mann-Whitney P < 0.001), also had significantly higher Ka/Ks (0.42; Mann-Whitney P < 0.001) compared with syntenic triads (table S22). The above relationships were consistent when using different percentage cutoffs to define stable and dynamic triads (5 and 25%), as well as in the developmental time course of the cultivar Azhurnaya (tables S21 and S22). These results show positive coupling of divergence in spatial expression patterns with divergence in TE and cis-regulatory elements in promoters and sequence divergence in coding sequence among wheat homoeologs. It is possible that divergence in spatial expression patterns, alongside relaxation of selection pressure, can lead to functional innovation through homoeolog neo- or subfunctionalization.

Coordinated expression of homoeolog triads

Our analyses provide a framework to describe the relative expression of individual homoeologs between discrete triads in space and time. To understand how this coordination of homoeolog spatiotemporal expression may influence biological processes, we developed a series of coexpression networks to provide insight into tissue-specific developmental and stress-related processes.

We constructed four separate tissue-specific coexpression networks from nonstress RNA-seq samples from grain (n = 119 samples), leaf (n = 245), root (n = 45), and spike (n = 128), using all genes expressed at more than 0.5 TPM in the given tissue (13). These networks were composed of 51 to 78 modules and contained 42.3 to 88.0% of all expressed genes in each tissue (fig. S22, table S23, and data S1). We found that across all tissue networks, homoeologs from 37.4% of the syntenic triads were in the same coexpression module, suggesting a highly coordinated expression pattern for these triads (χ2 P < 0.0001 with respect to random triads; table S24). However, the majority of triads (62.6%) had at least one homoeolog outside the same module.

To quantify whether homoeologs outside the module had similar or divergent expression patterns, we calculated a threshold based on the pairwise distance between homoeologs (13). We found that 29.6% of syntenic triads had a divergent pattern, wherein the expression of at least one homoeolog exceeded the distance threshold in the tissue network (Fig. 4A and fig. S23). Conversely, 33% of triads had a similar pattern, wherein all pairwise distances between homoeologs were lower than the threshold, suggesting a subtler variation in a single homoeolog. These values showed significant variation between tissue networks, ranging from 7% divergent triads in the leaf network to more than 38% divergent triads in the root and grain networks (Fig. 4A). Nonsyntenic triads had a higher proportion of divergent triads in all tissue networks compared with syntenic triads (mean, 35.1 versus 29.6%; χ2 P < 0.001; table S24). Using the same criterion as before (triad mean distance between tissues and global average position), we identified the 10% most stable and dynamic syntenic triads for each tissue-specific network. We found that dynamic triads were more frequently in divergent modules than stable triads for all four tissue networks (P < 0.001; Fig. 4B and fig. S24). These results are consistent with the homoeolog expression bias analyses and support the idea that although many triads are expressed in a coordinated spatiotemporal pattern (with the same or similar profile), almost 30% of syntenic and 35% of nonsyntenic triads have a divergent expression profile. Transcriptional divergence occurs both immediately upon polyploidization and after polyploidization (figs. S12 to S14) and may represent initial steps toward neo- or subfunctionalization of wheat homoeologs.

Fig. 4 Homoeolog coexpression patterns in tissue networks.

(A) Triad assignment to same, similar, and divergent modules in tissue coexpression networks. (B) Stable, middle, and dynamic triad assignment to same, similar, and divergent modules in the root network. (C) Neighbor-joining phylogenetic tree of homologs for the Arabidopsis MADS_II gene AGL21. Wheat orthologs from chromosome group 2 are assigned to the root-specific module 61 (red), whereas chromosome group 6 orthologs are assigned to modules 1 and 13 (blue and purple).

Exploiting development and stress networks for biological discovery

To explore the potential for biological discovery, we first compared modules between networks to identify tissue-specific gene networks. Across the four networks, 73.2% of modules had significant overlap (Fisher’s exact test, P < 0.05) with modules in all four networks, with the root having the fewest conserved modules (61.1%) and the spike having the most (86.2%) (data S2). In the root, there were three modules that were not found in any other tissue, with the largest of these (root module 61; 82 genes) enriched for root-related plant ontology (PO) terms (e.g., root procambium, P = 3.3 × 10–5, and central root cap of primary root, P = 4.5 × 10–5; table S25). We hypothesized that genes encoding TFs controlling processes related to these PO terms would also be coexpressed within module 61. We found that four of the 10 genes encoding TFs in this root-specific module had known functions related to root development in Arabidopsis or rice (38, 39) (table S26). Three of these TFs belonged to one homoeolog triad in the MADS_II family, and one of their Arabidopsis orthologs (AGL21) has been shown to regulate lateral root development through auxin accumulation (39). To understand the target genes of these TFs in wheat, we conducted a complementary network analysis using genie3, which predicted target genes of TFs across all 850 samples (13). Target genes of the three TFs were enriched for cell wall processes and lignification, consistent with their putative role in the differentiation zone where lateral roots emerge (tables S27 and S28). Closely related paralogs on chromosome group 6 in wheat were not located in root module 61; rather, they were in modules 1 and 13 (Fig. 4C). These modules were conserved in all other tissue networks, implying a more general function for genes within them. Supporting this hypothesis, the rice ortholog of the chromosome 6 paralogs (OsMADS57) has been shown to play a role in tillering (40).

A key challenge for wheat breeding is the selection of cultivars with tolerance to multiple stresses. Therefore, we focused on stress responses in seedlings and young vegetative plants, for which 10 independent studies with 12 distinct abiotic and disease stresses were available (table S1). We constructed gene coexpression networks for abiotic and disease stresses separately, including control samples from the same studies to allow for links between disease status and gene expression (13). We integrated the two networks to identify modules that might be common to both abiotic and disease responses. We found 84 pairs of modules between the two networks that had significantly overlapping gene content and were significantly correlated with both an abiotic and a disease stress (tables S29 to S31). The most significant overlap was between disease module 12 and abiotic module 2 (P =1.3 × 10–94), which shared 355 genes (Fig. 5A). These two modules had similar enrichment for GO-slim terms relating to signal transduction and response to stimulus (table S32), suggesting that they might perform similar biological functions.

Fig. 5 Overlapping modules within abiotic and disease stress networks.

(A) Number of genes in abiotic module 2 and disease module 12 and the overlap between modules. The number of transcription factors is indicated in parentheses. (B) Transcription factors found in both abiotic 2 and disease 12 (left) and the top five enriched GO terms of their targets, as identified by genie3 (right).

Among the 355 shared genes, there were 16 encoding TFs, six of which have orthologs in rice or Arabidopsis with proven roles in abiotic or disease stress, and a further three have orthologs differentially expressed during stress in these species (table S33). Furthermore, on the basis of the genie3 analysis, 11 of the 16 TFs have targets that are enriched in stress responses, and seven have targets that are enriched simultaneously in biotic and abiotic stress responses (Fig. 5B). Of the genes encoding these TFs, two homoeologs stood out as potential common regulatory components of abiotic and disease response: TraesCS5A01G237900 and TraesCS5B01G236400, which encode heat shock factor (HSF) TFs. These two HSF TF–encoding genes were in the top 10 most central genes within disease module 12 (table S34), as measured by intramodule connectivity, a value strongly correlated with the influence of a gene on a phenotype (41). The 387 predicted targets of the TFs encoded by these two genes were frequently allocated to module 12 of the disease network (39.5%) and module 2 of the abiotic stress network (28.0%) (table S35). The Arabidopsis ortholog of these genes, TBF1, was originally identified for its role in pathogen defense response (42) and has been shown to play a key role in the transition from growth to defense (43), while also positively regulating acquired thermotolerance (44). Recently, a “TBF1 cassette” including the promoter and 5′ leader region of TBF1 was used to engineer broad-spectrum disease resistance in both Arabidopsis and rice without a fitness cost (45). The fact that Arabidopsis TBF1 is functional in rice suggests that this regulatory mechanism may be conserved across species, making the wheat orthologs identified here promising targets for further studies. These and other highly connected genes (table S34) are strong candidates for controlling stress responses, and the functions of their orthologs support this hypothesis. These results demonstrate the power of the datasets and show that integrating gene networks from wheat, alongside phylogenetic relationships and knowledge of biological function in model species, can help identify candidate genes for further study in wheat.

Concluding statement

This study provides detailed insight into the spatiotemporal transcriptional landscape of polyploid wheat. We find evidence that the differences in relative expression among homoeologs observed in modern-day wheat may have been established both upon the polyploidization of wheat itself and during the subsequent 10,000 years of polyploidy; these differences may have been determined through epigenetic changes affecting both DNA methylation and histone modifications. We identified asymmetries along wheat chromosomes for a series of features relating to homoeolog expression bias with important implications for breeding. Our work provides a framework for the generation of hypotheses about biological function in polyploid wheat, which can now be experimentally tested using recent developments in sequenced mutant populations (46) and genome editing approaches (47). Ultimately, this knowledge will help researchers and breeders modulate allelic variation across homoeologs to improve quantitative traits in polyploid wheat. This is an urgent task for achieving global food security, given that wheat provides more than 20% of the protein and caloric intake (48) of humans.

Materials and methods

RNA-seq samples

We included 246 samples previously described (15) and complemented this with 283 RNA-seq samples which were deposited in the Short Reads Archive (SRA) between August 2015 and January 2017. An additional 321 RNA-seq samples from six studies were used for this analysis and are detailed in the supplementary materials (13).

Mapping of RNA-seq reads to reference

For all 850 samples, metadata was assigned as described in (15), with high and low-level factors for tissue, age, variety, and stress. Due to the relatively large number of low-level tissues (59), we further defined an intermediate level of tissues comprising 32 factors (average 26.5; median 12 replicates per factor) which was used for this study. We also assigned an intermediate level of stress comprising 15 factors (average 14.5; median 6 replicates per factor). We used kallisto v0.42.3 (14) to map the 850 RNA-seq samples to the Chinese Spring RefSeqv1.0+UTR transcriptome reference. We used default parameters previously shown to result in accurate homoeolog-specific read mapping in polyploid wheat (15) (fig. S1). We summarized expression levels from the transcript level to the gene level using tximport v1.2.0. We established the criteria that at least 1% of samples for a given gene all required to have expression values over 0.5 TPM for that gene to be considered expressed (initial 850 filter).

To confirm that kallisto enables homoeolog specific mapping (15) we analyzed expression of HC genes expressed >0.5 TPM in nulli-tetrasomic wheat lines from the publicly available study SRP028357 (49). The nulli-tetrasomic lines were missing an entire chromosome (1A, 1B, or 1D) which was replaced by a duplication of another homoeologous chromosome, e.g. Nulli1ATetra1B has 0 copies of 1A, 2 copies of 1B and 1 copy of 1D. We determined stringent homoeolog-specific mapping using a series of criteria detailed in the supplementary materials (13).

Analyses of expressed genes

Starting from the subset of genes considered expressed using the initial 850 filter criterion, we determined genes which were expressed in at least one tissue within the Azhurnaya developmental time course (209 samples; 22 intermediate tissues) and Chinese Spring no stress (123 samples; 15 intermediate tissues) datasets. For this analysis, we first calculated the average TPM expression of each gene in each of the intermediate tissue types (average expression per tissue). The number of samples that went into generating this average expression per tissue value varied for each intermediate tissue and are available in table S1. We considered a gene expressed when its average expression per tissue was > 0.5 TPM in at least one intermediate tissue. For both datasets we focused on HC gene models (10). Whilst expression data was also assessed for LC genes, we excluded these from the main analysis to avoid confounding effects from pseudogenes and low-quality gene models. Through this analysis we found evidence of expression for 83,741 (75.6%) HC genes in Azhurnaya and 82,567 (74.5%) HC genes in Chinese Spring.

Using the average expression per tissue values, we also determined the global expression of each gene across all tissues in which it was expressed (based on the >0.5 TPM criteria in the tissue). This generated an average value across tissues, rather than a geometric mean across all samples, to account for the variation in the number of samples per tissue. It also excludes tissues in which a gene is not expressed. This average across expressed tissues is referred to as either the “global analysis” or the “combined analysis (all tissues)” across the main text and in the supplementary materials and tables.

Relative expression levels of the A, B, and D subgenome homoeologs across triads

The analysis focused exclusively on the gene triads which had a 1:1:1 correspondence across the three homoeologous subgenomes, including 17,400 syntenic and 1074 nonsyntenic triads (total of 18,474 triads or 55,422 genes). Starting from the subset of genes considered expressed using the initial 850 filter criterion, we defined a triad as expressed when the sum of the A, B, and D subgenome homoeologs was > 0.5 TPM. This allowed us to include triads in which, for example, only a single homoeolog was expressed, and which could later be classified as a dominant triad. Using this criterion, we defined a total of 53,259 genes (17,753 triads) which were considered expressed (table S3). To standardize the relative expression of each homoeolog across the triad, we normalized the absolute TPM for each gene within the triad as followsEmbedded ImageEmbedded ImageEmbedded Imagewhere A, B, and D represent the gene corresponding to the A, B, and D homoeologs in the triad. The normalized expression was calculated for each one of the intermediate tissues and for the average across all expressed tissues (“combined analysis” as described previously). Fig. S6 shows an example of these calculations for the roots and the combined analysis across three triads. The values of the relative contributions of each subgenome per triad were used to plot the ternary diagrams using the R package ggtern (50).

Definition of homoeolog expression bias categories

The ideal normalized expression bias for the seven categories was defined as shown in table S37.

We calculated the Euclidean distance (rdist function from R 3.3.2) from the observed normalized expression of each triad to each of the seven ideal categories listed above. We assigned the homoeolog expression bias category for each triad by selecting the shortest distance. This was done for each of the intermediate tissue as well as for the average across all expressed tissues (combined analysis).

Analysis of the effects of polyploidy on homoeolog expression bias

We used RNA-seq data (25) which consisted of two datasets based on RNA-seq samples from the youngest leaf at fifth leaf stage. Dataset 1 (SHW1) included samples from tetraploid (BBAA) Triticum turgidum ssp. turgidum wheat accession AS2255, diploid Ae. tauschii (DD) accession AS60, and the synthetic hexaploid wheat (SHW1; BBAADD) resulting from the cross between the tetraploid and Ae. tauschii accessions. Dataset 2 (SHW2) consisted of tetraploid T. turgidum ssp. durum cv Langdon (BBAA), the same diploid Ae. tauschii (DD) accession AS60, and an independent synthetic hexaploid wheat (SHW2) derived from Langdon x AS60 (BBAADD). Note that AS2255 and Langdon are both T. turgidum ssp., but are defined as different subspecies based primarily on morphological features. These experiments recreate the polyploidization events that gave rise to modern bread wheat and the resulting SHW has the same genome composition as the CS and Azhurnaya datasets examined in this study.

We analyzed the RNA-seq from both datasets by mapping reads to the CS RefSeqv1.0 transcriptome using the same bioinformatics pipeline as before (see “Mapping of RNA-seq reads to reference” section). However, for the tetraploid datasets we used only the A and B subgenome transcripts as a reference, for the diploid D genome datasets we used only D subgenome transcripts, and for the SHW datasets we used the complete RefSeqv1.0 transcriptome as the reference, as in CS and Azhurnaya. To generate the expected hexaploid wheat transcriptome based on progenitor species we weighted the TPM values from the tetraploid by 2/3 and the AS60 TPM values by 1/3 to maintain a total TPM of 106 in the combined dataset. The in-silico hexaploid wheat generated from the weighted tetraploid and diploid TPM values (referred to hereafter as the “expected” in-silico dataset) allows the direct comparison with the observed TPM values in SHW. We defined the seven homoeolog expression bias categories for both the expected in-silico and the observed SHW transcriptomes using the same methods as for CS and Azhurnaya and compared the classification of triads between the observed and expected datasets (table S12). We next compared classifications to modern-day bread wheat CS and Azhurnaya. To enable a meaningful comparison across similar tissues from the Hao et al study (25) we used nine samples from the PAMP Triggered Immune Response dataset from CS and six samples from the Azhurnaya dataset (table S1). As before, we defined the seven homoeolog expression categories for the defined CS and Azhurnaya datasets and compared them with the SHW and the in-silico classifications (table S11).

DNA methylation plant material and library preparation

Plants were grown as described in the Chinese Spring tissues study. The frozen leaves from the five samples at 3-leaf stage (Zadok stage 13) were ground and divided as input for the preparation of both RNA-seq libraries (detailed in Chinese Spring tissues study) and whole genome bisulfite sequencing (WGBS) libraries. These samples enabled direct comparisons between the DNA methylation profile and homoeolog expression patterns in the same samples. WGBS libraries were constructed from purified nuclei prepared using the published methods (51). Input DNA was quantified using the Qubit high sensitivity DNA kit. A total of 500 ng of nuclear DNA was spiked with 270 pg of Lambda DNA to assess the conversion efficiency obtained using the EZ DNA Methylation-GoldTM Kit (Zymo research corp, Irvine, Ca, USA). WGBS libraries were prepared using the TruSeq DNA kit, (Illumina, Madison, WI) and 2 × 125 bp paired-end sequence reads was generated using the Illumina HiSeq 2500 v4 platform (Genome Quebec, Montreal, QC, Canada). The data was deposited as SRP133674.

DNA methylation data analysis

Sequence quality and adaptor removal was performed using Trim_galore_v0.4.1 (52). High quality paired-end sequence reads were aligned to the RefSeqv1.0 Chinese Spring genome using Bismark version 0.16.1 (53) ensuring the removal of duplicate reads and only retaining unique unambiguous alignments. The data were processed to exclude regions with low coverage using a binomial test. The methylation data was annotated using the gene feature coordinates provided by the RefSeqv1.0 Chinese Spring gene definitions. 5 kb flanking regions around the gene features were also extracted. The two flanking regions and the gene feature were each divided into 50 tiles (150 tiles in total) to summarize the observed methylation ratios. Data manipulation, statistical analysis and image generation were performed using the R language (54) utilizing the data.table (55), MethylKit v1.5.2 (56), genomation (57), and ggplot2 packages (58).

Comparison of RNA-seq sample classification with DNA methylation

For the five RNA-seq samples (from the same plants used for analyzing DNA methylation) we classified triads into the seven balanced, dominant, and suppressed categories using the same method as for previous analyses. We then classified homoeologs within dominant and suppressed triads into the “dominant” and “nondominant” homoeologs, and “suppressed” and “nonsuppressed” homoeologs. For example, in an A dominant triad the A subgenome is classified as “dominant” and B and D subgenomes are classified as “nondominant” homoeologs. The DNA methylation patterns of genes in each of these categories were plotted using the methods described above (DNA methylation data analysis). Differences in DNA methylation levels between categories were tested pairwise using the nonparametric Mann-Whitney t test using the wilcox.test() in R (fig. S17).

Histone modification analysis

To study the role of histone modifications we carried out ChIP-seq for three active marks (H3K36me3, H3K9ac, and H3K4me3) and one repressive mark (H3K27me3) (deposited under SRA accession number SRP126222). We used Chinese Spring at 3-leaf stage, however RNA-seq data were not collected from these exact plants. To calculate the homoeolog expression bias we used Chinese Spring samples from a separate experiment (PAMP-triggered immune response study) in which the same tissue was collected at a similar stage (3-leaf stage). Whilst combining data from two separate experiments may introduce some noise into the analysis, the ChIP-seq and RNA-seq data are from similar tissues, at a similar growth stage, in the same wheat variety, and are thus highly comparable. Nevertheless, this confounding factor should be considered when interpreting these results. ChIP assays, DNA library preparation, and sequencing were performed as in (10).

Histone data analysis

Raw FASTQ files were preprocessed with Trimmomatic v0.36 (59) to remove Illumina sequencing adapters, trim 5′ and 3′ ends with quality score below 5 (Phred+33) and discard reads shorter than 20 bp after trimming. Paired-ends reads were aligned against IWGSC RefSeq v1.0 assembly using bowtie2 v2.3.3 with --very-sensitive settings (60). Alignments with MAPQ < 10 were discarded and duplicate reads removed with Picard MarkDuplicates (http://broadinstitute.github.io/picard/). Triad expression category was calculated using Chinese Spring samples from a separate experiment (PAMP-triggered immune response study) using the same method as in previous analyses. As in the DNA methylation analysis we classified homoeologs within dominant and suppressed triads into the “dominant” and “nondominant” homoeologs, and “suppressed” and “nonsuppressed” homoeologs.

We calculated meta-gene profiles for each category by computing the read density of each histone mark over different triads categories using Deeptools (61) computeMatrix scale-regions and plotted it with plotProfile. To make a statistical comparison, for each histone mark we scored the number of reads overlapping with gene bodies using bedtools coverage –counts (62). Only reads fully mapping within gene bodies were considered. To account for different gene size we divided the read counts over each gene by its length. The distributions of reads density over different triads categories were compared with a nonparametric t test (Mann-Whitney U-test) using the function wilcox.test in R (fig. S11).

Variation in homoeolog expression bias across tissues (stable and dynamic triads)

To define the variation in homoeolog expression bias of each triad across the intermediate tissues we calculated the Euclidean distance between the triad’s global position (combined analysis) and each individual tissue in which the triad was considered expressed. We included only triads which were considered expressed in at least six tissues based on the combined analysis criteria outlined above. The average of these distances was defined as the “triad mean distance”. We ranked triads by their triad mean distance and the percentile was calculated byEmbedded Imagewhere CMD is the vector containing all the triad mean distance. The first and last deciles were classified as stable 10% and dynamic 10% triads, respectively. A similar approach was used to define the corresponding 5% and 25% extremes of the distribution. This analysis was conducted independently for the Chinese Spring no stress samples, the Azhurnaya developmental time course, and for each of the four tissue-specific networks. A visual representation is provided in fig. S19.

TE presence in gene promoters

We extracted all TEs that were annotated to fall at least partly within 1.5 kb and 5 kb upstream of the canonical ATG start-codon for all genes. We then split the TEs into the relevant gene lists covering homoeolog expression bias variation (stable 10%, middle 80%, and dynamic 10%) and homoeolog expression bias (balanced, dominant, nondominant, suppressed, and nonsuppressed) based on the “combined analysis (all tissues)” for CS. We used these lists to identify the proportion of genes and triads in each category which contained at least one TE in the promoter region.

Enrichment of TE families in gene promoters

Using the GFF file of TE coordinates, we extracted TEs present in the promoter regions of HC genes. We retrieved all TE copies that are entirely or partially present in the 5 kb upstream of the ATG start-codon of the canonical transcript for each gene. We then calculated the number of genes in each of the stable 10%, middle 80%, and dynamic 10% categories which contained specific TE families. We required the TE family to be present in at least 2% of the categorized genes for further analysis. We then found the deviation of this distribution from the expected 10-80-10 ratio using the χ2 test, P values adjusted with Benjamini-Hochberg. We calculated the median length of each TE family based on all instances of that TE across the genome. We found fifteen TE families deviated significantly from the expected 10-80-10 distribution (Benjamini-Hochberg P < 0.01). However, the majority of these TE families were present in less than 5% of the genes considered, and showed very small variation in the number of promoters containing the TE, suggesting that the statistical significance may not be biologically relevant.

TE density in gene promoters

We calculated the density of TEs within 5 kb upstream of genes by calculating the proportion of TE bases in sliding windows of 100 bp with a step size of 10 bp. The mean of each window was then calculated for both the stable 10%, middle 80%, and dynamic 10% triads and the subgenome dominance categories (balanced, dominant, nondominant, suppressed, and nonsuppressed). Mann Whitney tests with Benjamini-Hochberg adjusted P values were used to test for differences in TE density between categories across each window.

Transcription factor binding site identification

The 1.5 kb of sequence upstream of the canonical ATG start-codon was used to identify transcription factor binding sites (TFBS) present in promoters of HC triads. The FIMO tool from the MEME suite [v 4.11.4 (63)] was used with a position weight matrix (PWM) obtained from plantPAN 2.0 (64) to predict TFBS based on previously identified sites across multiple plant species. FIMO was run with a P value threshold of <1E-04 (default), --motif-pseudo set to 1E-08 as recommended for use with PWMs and a --max-stored-scores of 1,000,000 to account for the large size of the dataset. The background model was generated from all extracted promoter sequences using the MEME fasta-get-markov command. Details of the TFBS comparisons between homoeologs is presented in the supplementary materials (13).

WGCNA network construction

Coexpression networks were built for six separate sample sets: grain, leaf, spike, root, abiotic and disease (table S1) using the WGCNA R package (65). For each network, we selected HC genes which were expressed > 0.5 TPM in three or more samples. The count expression level of each gene was normalized using variance stabilizing transformation from DESeq2 (66) to eliminate differences in sequencing depth between studies. The soft power threshold was calculated as the first power to exceed a scale-free topology fit index of 0.9 for each network separately. The soft powers used were: leaf = 12, spike = 12, roots = 7, disease = 7. For the abiotic and grain network the 0.9 threshold was not crossed until 15 and 20 respectively, which may be due to strong differences between samples within these datasets, therefore the soft power threshold was selected according to the number of samples, resulting in abiotic = 7 and grain = 6. Signed hybrid networks were constructed blockwise using the function blockwiseModules() with a maximum block size of 46,000 genes. The correlation type used was biweight mid-correlation “bicor” and the maxPOutliers was set to 0.05 to eliminate effects of outlier samples. The topographical overlap matrices (TOM) were calculated by the blockwiseModules function using TOMType = “unsigned” and the minimum module size was set to 30. The parameter mergeCutHeight = 0.15 was used to merge similar modules.

Defining same, similar, and divergent expression patterns of triads

For triads which had homoeologs within different modules in the WGCNA networks we developed a threshold to determine whether the different modules had a “similar” or “divergent” expression pattern. We calculated the Euclidean distance between module eigengenes using the R package dist() and with these values we calculate the distances between the homoeologs in each triad. Triads where the pairwise distances were zero were in the same module. Triads where the pairwise distances were over zero were in different modules. For these triads in different modules when the pairwise distance between any two homoeologs was > 50% of the median maximum distance between eigengenes, the triad was classified as having a “divergent” expression pattern. In cases where the pairwise distance was over zero between at least one pair of homoeologs and the distance between all three pairs of homoeologs were =< 50% of the median maximum distance, the triad was classified as having a “similar” expression pattern. The median maximum distance between eigengenes was averaged across all four tissue networks to give a final threshold (50% of median maximum distance) of 0.937431. This analysis was carried out for 1:1:1 syntenic and 1:1:1 nonsyntenic triads expressed in each of the tissue networks (total triads = 9599 grain, 5378 leaf, 11,038 root, and 6173 spike). This excluded triads which had a putative transposable element (67 triads). Fig. S23 shows a graphic representation of this classification and the effect of altering the threshold in each of the four networks.

Module overlaps

Module overlap between networks was calculated using the R package GeneOverlap which calculates significant overlaps between modules using a Fisher’s exact test. Modules were considered to have significant overlaps when the FDR adjusted P value < 0.05.

Correlation to stress status

Modules within the abiotic and disease networks were tested for correlations to intermediate level stresses using the cor(function). The significant of correlations were calculated using the function corPvalueStudent() and corrected for multiple testing using p.adjust() using the Benjamini & Yekutieli method (67).

Genie3

HC genes expressed >5 TPM in at least one of the 850 sample were selected. Out of these 78,085 genes there were 3386 transcription factors (methods described above). Random forest regression was estimated for each gene based on the transcription factors as inputs using the genie3 package (68) in R (version 3.3.2) with default parameters (K=sqrt, nb.trees=1000, input.idx=list of transcription factors, importance.measure=IncNodePurity, seed=NULL). For each transcription factor, all predicted target genes (connectivity > 0.005) were extracted and functional enrichment within the target genes was determined using topGO (69) in R (version 3.3.2) with the following parameters (ontology = “BP”, nodeSize = 10, classic Fisher test P < 10–10). To summarize the results, the top three GO terms for each transcription factor, the P values for the strongest enrichment, and the direct blastx match in the Arabidopsis proteome (tair10) and well as the e-value and description were tabulated (table S28). The complete list of GO term enrichments of the biological process ontology for each transcription factor and the list of transcription factors associated with each GO term in the ontology of biological process are published in e!DAL (https://doi.ipk-gatersleben.de/DOI/53148abd-26a1-4ede-802b-c2635af6a725/0dd8224a-34fc-4e3b-8ab8-883d07e52bd2/2/1847940088).

Identifying highly connected hub genes

Hub genes within each module for the abiotic and disease stress networks were calculated using the WGCNA R package function signedKME(). This calculates the correlation between the expression patterns of each gene and the module eigengene. Genes which were more highly correlated to the eigengene were considered hub genes.

Supplementary Materials

www.sciencemag.org/content/361/6403/eaar6089/suppl/DC1

Additional Materials and Methods

Figs. S1 to S24

Tables S1 to S37

IWGSC Collaborator List

References (7097)

Data S1 and S2

References and Notes

  1. Additional materials and methods are available as supplementary materials.
Acknowledgments: We thank Bayer Crop Science staff members E. Caestecker and X. Wang for data analysis and B. Staelens, T. Debaecke, and A. Dobbelaere for plant growth and sampling. We also acknowledge the assistance of M. Burrell (NBI Computing) and A. Etuk (EI Digital Biology). Funding: This work was supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) through the Designing Future Wheat (BB/P016855/1), GEN (BB/P013511/1), and Plant Health (BB/P012574/1) ISPs; Triticeae Genomics for Sustainable Agriculture (BB/J003557/1); ERA-PG (BB/G024960/1); ERA-CAPS (BB/N005007/1); and an Anniversary Future Leaders Fellowship to P.B. (BB/M014045/1). This work was also supported by the International Wheat Yield Partnership (IWYP76); the German Federal Ministry of Food and Agriculture (2819103915); the German Ministry of Education and Research (031A536); DFG (SFB924); Genome Canada/Ontario Genomics (OGI-128); the Canadian Applied Triticum Genomics project (CTAG2) funded by Genome Canada, Genome Prairie, Western Grains Research Foundation, Saskatchewan Wheat Development Commission, Alberta Wheat Development Commission, and Saskatchewan Ministry of Agriculture; the National Research Council of Canada Wheat Flagship program; the WGRCI/UCRC partially funded by NSF (IIP-1338897); French Agence Nationale de la Recherche grants ANR-11-BSV5-0015 and ANR-16-TERC-0026-01; and the European Research Council. S.A.H. was supported by the John Innes Foundation. C.J. was supported by Région Auvergne and the European Regional Development Fund (SRESRI 2016). This research was also supported in part by the NBI Computing infrastructure for Science (CiS) group through the HPC resources. The submission of sequencing data was brokered by the COPO platform (https://copo-project.org), funded by the BBSRC (BB/L024055/1), and supported by CyVerse UK, part of the Earlham Institute National Capability in e-Infrastructure. Author contributions: R.H.R.-G., P.B., and C.U. conceived, designed, and coordinated the study. R.H.R.-G. and P.B. organized RNA-seq samples and assigned metadata. R.H.R.-G. carried out the mapping and developed methods to analyze homoeolog expression patterns. R.H.R.-G. and C.U. analyzed homoeolog expression patterns and variation between tissues, cultivars, syntenic and nonsyntenic triads, chromosomal partitions, and progenitor species. P.B. constructed WGCNA coexpression networks, analyzed homoeolog coexpression, and identified biological case studies in the networks. P.B. carried out differential expression analysis between tissues. D.Lan., K.F.X.M., and M.S. generated gene annotations, performed phylogenomics analysis (including gene family, ortholog and homoeolog inference, phylogenetic trees, and TF superfamily classification), and analyzed subgenome expression bias. A.P. led informatics development, and Y.K. generated pictograph drawings for the wheat eFP portal, supervised by A.G.S. and N.J.P. A.Br. ran and analyzed the genie3 network. S.A.H. analyzed Ka/Ks ratios. A.T.C., S.J.R., and A.G.S. performed and analyzed DNA methylation profiles. D.Lat. performed ChIP-seq experiments, and L.C. performed bioinformatic analysis of ChIP-seq data, supervised by M.B. and A.Be. E.P. and C.J. analyzed histone marks. S.A.H., J.B., R.H.R.-G., and L.V. carried out the promoter analysis. T.F. illustrated wheat development for Fig. 1. E.P. identified chromosome partitions. F.C. identified and S.A.H. and J.B. analyzed transposable elements. J.B. conducted promoter cis-regulatory element analysis. M.D., J.J., F.v.E., S.J.R., A.T.C., H.S., B.S., D.X., C.J.R., B.C., B.B.H.W., R.A., V.T., R.D., C.J.P., and A.G.S. provided RNA-seq samples. P.B. and C.U. wrote the manuscript. R.H.R.-G., S.A.H., J.B., A.Br., D.Lan., A.T.C., L.C., C.J., E.P., and T.F. contributed text and figures for the manuscript. L.V., M.D., J.J., F.v.E., Y.K., J.B., H.S., B.S., C.J.R., R.A., V.T., R.D., F.C., C.J.P., N.J.P., A.G.S., and M.S. provided comments on the manuscript. All authors read and approved the final submission. Competing interests: The authors declare no conflicts of interest. Data and materials availability: All code used for the analyses of the datasets can be found at https://github.com/Uauy-Lab/WheatHomoeologExpression, and data files are deposited in https://grassroots.tools. Sequencing reads were deposited with NCBI under accession codes PRJEB25639, PRJEB23056, PRJNA436817, SRP133837, PRJEB25640, and PRJEB25593 for RNA-seq; SRP133674 for whole-genome bisulfite sequencing; and SRP126222 for ChIP-seq. The full phylogenetic tree presented in Fig. 4C is available at http://itol.embl.de/shared/borrillp. Details of each dataset can be found in the methods and the supplementary materials.
View Abstract

Navigate This Article