Supplementary Materials

Core and region-enriched networks of behaviorally regulated genes and the singing genome

Osceola Whitney, Andreas R. Pfenning, Jason T. Howard, Charles A Blatti, Fang Liu, James M. Ward, Rui Wang, Jean-Nicolas Audet, Manolis Kellis, Sayan Mukherjee, Saurabh Sinha, Alexander J. Hartemink, Anne E. West, Erich D. Jarvis

Materials/Methods, Supplementary Text, Tables, Figures, and/or References

Download Supplement
  • Materials and Methods
  • Figs. S1 to S12
  • References

Additional Data

Tables S1 to S22
Table S1. Annotations of oligos and corresponding transcripts on the songbird 44K microarray. Listed items include oligo ID (column A), Duke cDNA ID (B), ESTIMA ID (C), NCBI Accession # (D), ENSEMBL ID when available (E), the source (F) of the original DNA sequence (0= Wada et al 2006(24); 1= Li et al 2007(35); 2 = Replogle et al 2008 (68)), whether the RNA is coding [0] or non-coding [1] (G), and the oligo nucleotide sequence synthesized on the array (H). Also shown are the final gene symbols (I) and gene description (J) used for analyses. The evidence that led to that final symbol and description is summarized (K). All symbols from the various methods of annotation are listed as well (L). Evidence for the annotations are in columns M through V. o, oligo; p, pasa defined annotations of clustered transcript sequence reads against the zebra finch genome. If a transcript representing a gene does not have a functionally identified name, then simply its pasa or clone ID was used as the gene name in columns I and J. The order of features (oligo and control spots on the array) is in the same order as listed adjacent to column A (available online).

Table S2. List of genes used to validate baseline expression with in situ hybridizations. The expression values (columns G-J) are the normalized levels from the microarray experiment (SM6). The in situ hybridization results (columns K-N) are scored either as true positive (TP), false positive (FP), false negative (FN), or true negative (TN) The sources of in situ hybridizations (column E) are Li et al (35), Lovell et al (36), Velho et al (102), George et al (77), Wada et al (24), Kubikova et al (34) or this study, (available online).

Table S3. List of genes used to validate singing-regulated expression by in situ hybridization and RT-PCR. The table has five sections: 1) Gene Information; 2) Verification Summary; 3) Microarray Information; 4) In situ Hybridization Information; and 5) RT-PCR Information. Verifications are of three varieties: in situ hybridization only (39 transcripts, cyan colored), in situ hybridization and RT-PCR (4 transcripts, blue), and RT-PCR only (33 transcripts). The gene information includes the specific cDNA clone ID that was tested (column C) and the corresponding transcript variant (column D). The verification summary section brings together both the in situ hybridization data and RT-PCR data and was used to calculate the final true positive (TP, green) and false positive (FP, red) rates as in (Fig. S2D). In addition, the table also shows false negative (FN, orange) and true negative (TN, light green) findings (columns F-I). These same classifications are used in the in situ hybridization (AB-AE) and the RT-PCR (T-W) sections. The values in the microarray section are FDR q-values, with significant differences highlighted in dark green for each song nucleus at a given time point relative to silent controls (K-N; timepoints in column R and Table S8). In both the in situ hybridization and the RT-PCR, gene expression was often measured at more than one time point. The sources of in situ hybridizations (column S) are Li et al (25), Wada et al (24), Jarvis and Nottebohm (17) or this study, (available online).

Table S4. Transcripts detected in song nuclei in this study. Listed are the 24,498 groups of expressed transcripts detected in song nuclei above the background spike in controls in at least 12% of the microarray samples (i.e. n=4 or more animals). Unique Transcript ID (column A) is the name given to the group of transcripts from the same gene that are expressed similarly. s, indicates transcript variant. Other columns show group IDs (B), gene symbols (C), gene description (D), ENSEMBL ID (E), zebra finch chromosome (F) and number of oligos that generated to the group of transcripts (G) with oligo IDs (H). For example, there are 8 oligos on the array that recognize transcripts from the glioblastoma amplified sequence (GBAS) ene, and these recognize transcript variants with 4 distinct patterns (GBAS-s1 to s4) (available online).

Table S5. Transcripts differentially expressed in song nuclei at baseline. Listed are the 5,167 transcripts detected as differentially expressed among song nuclei at baseline non-singing activity using a correlation and distance measure (SM8). The unique transcript ID (A) identifies the specific transcript and the symbol (B) of the gene to which it maps (available online).

Table S6. Functional enrichment data for differentially expressed genes at baseline. Rows are colored according to cluster membership (A). The functional categories (B) are based on searches of genes categories from different sources, including gene ontology (C; source details in Table S7). The enriched sets of genes for each category are identified in column E. Rows are color-coded according to baseline-region cluster membership (as in Fig. 2A). Rows in bold text are the categories that show the strongest enrichment for each regional cluster, based on number of genes (D), p-values based on hypergeometric tests (F and G), and percent of genes from that cluster showing such enrichment (H) (available online).

Table S7. Gene expression sets from prior studies used for comparative enrichment analyses. Listed are the names we have given to the genes sets, their description, numbers, whether the experiments were conducted in vivo or in cell culture, and the literature and PubMed ID source of the data (103-117) (available online).

Table S8. Transcripts differentially expressed in song nuclei after singing. Listed are the 2,740 transcripts detected as differentially expressed among song nuclei after singing. The unique transcript ID (column A) identifies the specific transcript and the symbol (column B) the gene to which it maps. The linear model FDR q-values are shown for each song nucleus (E-F), as well as the temporal cluster to which the transcripts belong to for each nucleus (J-M). NA, not applicable; xReg, regulated by singing. The data are sorted from most to least significant for combinations of the four song nuclei. The core 20 transcripts detected as singing-regulated in all four song nuclei are highlighted in green, and the remainder of the core in three or on the border of four song nuclei are in yellow (available online).

Table S9. Differential singing-regulation of alternative transcript variants. Listed are the differential singing-regulated alternative transcript variants (s1, s2, s3, …) detected by oligos specific to those variants. These variants include those that are alternatively spliced, alternatively started, and alternatively polyadenylated. Of the 2740 singing-regulated transcripts, 390 were differentially-regulated alternative variants from 82 genes. For example, we detected UNC5A transcript variant s1 (UNC5A-s1) as regulated by singing in Area X, but transcript variant s2 (UNC5A-s2) as regulated in LMAN. Group ID is the ENSEMBL gene ID based on genome mapping; the singing-regulated brain region cluster is from the temporal clusters (available online).

Table S10. Functional enrichment data for singing-regulated genes. (A) Enrichment in regional singing-regulated clusters of transcripts. (B) Enrichment in temporal singing-regulated clusters of transcripts. Rows are colored according to cluster membership (column A). The functional categories (column B) are based on searches of genes categories from different sources, including gene ontology (column C; source details in Table S7). The enriched sets of genes for each category are in column E. The p-values (columns F and G) are based on hypergeometric test. The % of list (column H) is the percent of transcripts relative to the total number of regulated transcripts for a given region or temporal cluster (available online).

Table S11. Proportions of transcripts among temporal singing-regulated clusters. The Table shows for each the 20 temporal clusters (column A) their region-specificity (columns D-H). Cluster size (C) is calculated as the number of transcripts that make up that temporal cluster. The percentage for each song nucleus (D-G) is the percentage transcripts for each temporal cluster that come from that song nucleus. We treated the percentage of transcripts for every song nucleus as a vector. That vector was compared against a vector representing every combination of regions using Euclidean distance to determine the regions enriched (H). For example, the vector for tan cluster (0.43, 0.24, 0.2, 0.13) was closest to the vector representing all regions (0.25, 0.25, 0.25, 0.25) as opposed to the vector for Area X (1, 0, 0, 0). As a result, the five clusters that show a strong representation in all regions have at least 10% of their transcripts from each region (available online).

Table S12. Statistical results for hypergeometric tests of overlap of transcripts from the baseline region-enriched, singing region-enriched, and singing temporal-enriched clusters. (A) Correlations between baseline region-, singing region-, and singing temporal-enriched clusters. (B) Correlations between baseline region- and singing region-enriched clusters. (C) Correlations between singing region- and singing temporal-enriched clusters. (D) p-values for baseline region- and singing region-enriched correlations. (E) p-values for baseline region and singing temporal-enriched correlations. Red text denotes when the region baseline pattern is correlated with region specific pattern from the same or subgroup of song nuclei (available online).

Table S13. Transcription factor motif enrichment data for temporal patterns of singing regulated gene clusters. Each row represents an association between a temporal singing regulated cluster and a transcription factor motif (column D) from the database in (column C, Table S14). Listed is the number of genes in the cluster for which an ENSEMBL annotation was found (column E) as well of the number of those genes that were identified as having the motif overrepresented in their non-coding regulatory region (column F) by a given method (column B). The significance of the association was quantified with four statistics; simulation p-value, hypergeometric p-value, locus length aware hypergeometric test p-value, and hypergeometric value (columns G-J) (available online).

Table S14. TRANSFAC and CUSTOM scanned binding motifs. (A) Listed are 118 binding motif names from the TRANSFAC database that corresponded to transcription factors that were differentially expressed in song nuclei at baseline or during singing. (B) 19 TRANSFAC motifs of TFs we hypothesized to be regulated by neural activity or plasticity. (C) 101 motifs used from the JASPAR database (available online).

Table S15. Enriched motifs in temporal clusters of behaviorally regulated transcripts expanded to individual transcripts. Listed are the transcription factor (column A) to target gene (column B) relationships predicted by the transcription factor binding site scans that were supported with enrichments between the TF's motif target gene set and the target gene's temporal singing-regulated cluster (column C) as shown in Fig. S3. The specific brain region enriched expression of the target gene (column D) is also listed (available online).

Table S16. Top 100 genes most affected by CaRF knockdown. Listed are the top 100 transcripts identified as most differentially expressed between all samples with the scrambled control versus CaRF knockdown in mouse cultured cortical neurons. The subgroup ID (A) identifies the specific transcript and the symbol. This is a separate set from the zebra finch IDs. (B) Corresponding Affymetrix probe ID. (C-E) Gene annotations. (F-I) Statistical values calculated for each transcript (available online).

Table S17. Pathway enrichment and gene ontology analysis of CaRF effected genes. (A) Gene sets built from MSigDB pathways were compared using GSEA (53) to the ranked list of genes effected by CaRF knockdown in the absence of membrane depolarization. Genes were ranked by signal to noise ratios using the GSEA default. Listed are name of the enriched pathway, the highest enrichment score, the nominal p-value, false discovery rate (FDR), family wise error rate (FWER), and size or number of genes found in the set. Thresholds for inclusion were p < 0.05 and q < 0.25. (B) Results of a gene ontology analysis (90) of the top 250 genes affected by CaRF knockdown in the absence of membrane depolarization, also ranked by GSEA according to signal-to-noise ratio (p < 0.05) (available online).

Table S18. Top 100 genes whose activity-dependence is most affected by CaRF knockdown. Listed are the top 100 transcripts identified as most differentially regulated by KCl membrane depolarization between the scrambled control virus and the CaRF knockdown virus infected mouse cultured cortical neurons. (A) The subgroup ID identifies the specific transcript and the symbol. This is a separate set from the zebra finch IDs. (B) Corresponding Affymetrix probe ID. (C-E) Gene annotations. (F-H) Statistical values calculated for each transcript (available online).

Table S19. Membrane depolarization- and CaRF-regulated genes that overlapped with singing-regulated genes. Listed are 55 genes that were regulated by singing and enriched in zebra finch Area X and HVC, and that showed membrane depolarization and CaRF regulation in cultured cortical mouse neurons. Of these, 9 have a putative CaRF binding site in the zebra finch genome (#1 in column D). P-values are FDR (available online).

Table S20. Quality control and sample annotation for H3K27ac ChIP-Seq. Each row corresponds to a different sample taken from the RA or Area X regions in silent and singing birds. Input comes from DNA with the H3K27ac antibody (DNA) or from whole cell extract (WCE, Column E). Highlighted green values represent quality control measures within a reasonable level of tolerance while red values represent low quality (available online).

Table S21. H3K27ac near genes that are differentially expressed across brain regions at baseline. (A-D) Listed are 3,397 transcripts that are differentially expressed at baseline and have at least one H3K27ac peak that maps to it. (E-H) The expression t-value, p-value, adjusted pvalue, and log-fold change for the expression in RA relative to Area X. (I) The "expGrp" classifies the genes as Area X enriched, RA enriched at p < 0.01 or neither. (J) Mean log-fold difference between in RA vs. Area X for all peaks that map to the gene corresponding to that transcript. (K, L) The most significant peak mapping to that transcript and its log-fold difference and category in Area X vs. RA (available online).

Table S22. H3K27ac near genes that are differentially induced across brain regions during singing. (A-D) Listed are 346 transcripts that are classified as late-response singing-regulated genes and have at least one H3K27ac peak that maps to it. (E-H) The expression t-value, pvalue, adjusted p-value, and log-fold change are shown for the match to the identified LRG profile in RA relative to Area X. (I) The "expGrp" classifies the genes as Area X enriched, RA enriched, or enriched in neither region during song production (p < 0.05). (J) The mean log-fold difference between in RA vs. Area X for all peaks that map to the gene corresponding to that transcript. (K, L) For the most significant peak mapping to that transcript, listed is the log-fold difference in Area X vs. RA and the peak category. (M) The expression category of the gene at baseline, where "Ax" is enriched in Area X at baseline, "Ra" is enriched in Ra at baseline, "diff" is genes that are differentially regulated in one or both of the nidopallial song nuclei, HVC and LMAN, and "none" refers to transcripts that are not detected as differentially expressed at baseline. Highlighted in blue are transcripts not differentially expressed at baseline, but singing regulated in Area X and have H3K27ac peak in their genes at baseline. Highlighted in red is the converse relationship for RA (available online).