In Silico Genetics: Identification of a Functional Element Regulating H2-Eα Gene Expression

See allHide authors and affiliations

Science  22 Oct 2004:
Vol. 306, Issue 5696, pp. 690-695
DOI: 10.1126/science.1100636


Computational tools can markedly accelerate the rate at which murine genetic models can be analyzed. We developed a computational method for mapping phenotypic traits that vary among inbred strains onto haplotypic blocks. This method correctly predicted the genetic basis for strain-specific differences in several biologically important traits. It was also used to identify an allele-specific functional genomic element regulating H2-Eα gene expression. This functional element, which contained the binding sites for YY1 and a second transcription factor that is probably serum response factor, is located within the first intron of the H2-Eα gene. This computational method will greatly improve our ability to identify the genetic basis for a variety of phenotypic traits, ranging from qualitative trait information to quantitative gene expression data, which vary among inbred mouse strains.

Commonly available inbred mouse strains can be used to genetically model many traits that vary in the human population, including those associated with disease susceptibility. We have previously shown that chromosomal regions regulating phenotypic traits could be computationally predicted by correlative analysis of phenotypic data obtained from inbred mouse strains and the extent of allele sharing within genomic regions (1). However, this computational method had several limitations. It identified 60-Mb chromosomal regions that contained hundreds of genes, and the predictions were assessed by relative (percentile ranking), rather than absolute (P-value), statistical criteria.

Single-nucleotide polymorphism (SNP) and allelic information in our mouse SNP database (2) were used to produce a highresolution map of the haplotypic block structure of the mouse genome covering 16 commonly used inbred mouse strains (3). This haplotype map was used to develop a computational method for genetic analysis of phenotypic traits in mice (4). Haplotypic blocks that best correlate with phenotypic trait data obtained from inbred strains were identified. As a first example, we tested the ability of the haplotype-based computational mapping method to predict the chromosomal location of the K locus of the major histocompatibility complex (MHC) located on murine chromosome 17 (∼33 Mb). The known properties of the MHC K locus of 16 inbred strains were used as input phenotypic data for this analysis (5) (Table 1). The categorical K locus data (b, k, d, u, v) were transformed into points in multidimensional metric space, such that each K locus category was equidistant from any other category, and then computationally analyzed for correlation with the haplotypic blocks (Fig. 1A). One haplotypic block best correlated with the phenotypic data. The predicted haplotypic block was on chromosome 17 [32.778 to 32.805 Mb, NCBI (National Center for Biotechnology Information) Build 30] and contained five genes (H2-Dma, Psmb9, Tap1, Psmb8, and Tap2). These genes are known to determine the MHC class I and class II phenotypes (6, 7). Consistent with this prediction, the K, Aβ, Aα, and Eβ phenotypes were identical among the 16 mouse strains tested (Table 1). The same computational prediction profile was obtained when any of the other class II phenotypes were used as input phenotypic data (computational profile identical to that in Fig. 1A).

Fig. 1.

Computational mapping of MHC phenotypic data onto haplotypic blocks. The correlation for each haplotypic block is represented as a bar, and the haplotypic blocks are arranged from proximal (centromeric) to distal (telomeric) for all 20 chromosomes. The variable width of each chromosome indicates the number of haplotypic blocks on that chromosome. The height of each bar is the inverse of the score measuring the correlation between the phenotypic data and the haplotypic block. The insets show predictions for haplotypic blocks on chromosome 17 for each phenotype. The relative location of the MHC genes on chromosome 17 is shown in the block diagram at the top of the figure. (A) Computational prediction of the location of the MHC class I-K locus. The MHC class I-K phenotypic data were analyzed for correlation with strain groupings within haplotypic blocks. The predicted haplotypic block was located in the MHC region on chromosome 17 at 33.78 to 32.81 Mb and contained both the MHC K and class II genes. (B) Computational analysis of MHC class III S locus phenotypic data. The predicted haplotypic block is located within the MHC region and contains the S-locus gene. (C) Computational analysis of the MHC class Ib Qa2 locus phenotypic data. The three predicted haplotypic blocks are all located on chromosome 17 within the region 33.32 to 33.82 Mb. They contain the Qa2 locus.

Table 1.

Phenotypic data used for computational haplotype-based genetic analysis. The MHC class Ia, Ib, II, and III phenotypes; hepatic AH response; and pulmonary H2-Eα gene expression data are shown. n/a, not available.

H2 Haplotypes
Strain Class Ia Class II Class III Class Ib H2-Eα gene expression AH response
K Aβ Aα Eβ S Qa-2
129/SvJ b b Null b a 5 -
A/J k k k d a 42 +
A/HeJ k k k d a 23 +
AKR/J k k k k b 44 -
B10.D2-H2/oSnJ d d d d b 1055 +
BALB/cJ d d d d b 1063 +
BALB/cByJ d d d d b n/a n/a
C3H/HeJ k k k k b 36 +
C57/6J b b Null b a 8 +
DBA/2J d d d d b 1343 -
LG/J d d d d b n/a n/a
LP/J b b Null b a n/a -
MRL/MpJ k k k k b 38 n/a
NZB/BinJ d d d d b n/a -
NZW/LacJ u u u z b n/a -
SM/J v v v v a n/a +

We next determined if the computational method could correctly identify haplotypic blocks associated with two other MHC phenotypes, the MHC class III S locus and the class Ib Qa2 locus (5). In both cases, the computationally identified haplotypic block with the highest correlation contained the MHC gene that corresponded with the phenotype analyzed (Fig. 1, B and C, and Table 2). The correct identification of Qa2 locus by the computational method was particularly encouraging, because there were only two different categorical phenotypes (a and b) at this locus (Table 1). This analysis of four MHC phenotypes demonstrates that the computational method can analyze input phenotypic data and accurately discriminate among colinear, neighboring haplotypic blocks in the mouse genome.

Table 2.

Haplotype-based computational predictions generated for five phenotypic data sets. For each phenotype, the location of the haplotypic block, which contains the gene determining the phenotype, along with its computational score are shown. The score for each haplotypic block was calculated as a function of the correlation between the phenotypic data and strain grouping within the haplotypic block. A smaller score indicates a better match, and 0 indicates a perfect match.

Phenotype Predicted locus Score Chr. MB
H2 K H2-K 0 17 32.78-32.81
H2 A α/β, E α/β H2-A α/β, H2-E α/β 0 17 32.94-32.94
H2 S H2-S 0.16 17 33.26-33.32
H2 Qa-2 H2-Qa2 0 17 33.32-33.42
H2-Eα gene expression H2-Eα 0.11 17 32.57-33.26
AH response Ahr 0 12 29.66-29.69

We next tested the ability of the computational mapping method to identify genetic loci regulating the aromatic hydrocarbon (AH) response among inbred mouse strains. Genetic variation in this metabolic response pathway can affect the response to drugs and environmental toxins and is also likely to affect cancer susceptibility (8, 9). The aromatic hydrocarbon receptor (AHR) is the ligand-binding component of the intracellular protein complex that regulates this response. After ligand binding, this receptor translocates to the nucleus where it forms a heterodimeric protein complex that stimulates the expression of genes involved in drug and xenobiotic metabolism. The AH response was previously characterized by measuring the amount of hepatic aromatic hydrocarbon hydroxylase activity induced after intraperitoneal injection of 3-methylcholanthrene in 42 inbred mouse strains (10). Thirteen of these strains were characterized in our mouse SNP database (2). The AKR/J, 129/SvlmJ, LP/J, NZB/BinJ, NZW/LacJ, and DBA/2J strains were AH nonresponsive, whereas the A/J, A/HeJ, C57BL/6J, BALB/cJ, B10.D2-H2/oSnJ, SM/J, and C3H/HeJ strains were AH responsive. The AH response phenotypes were treated as a qualitative binary phenotype (either + or –) for the computational analysis. The pattern of genetic variation within only two haplotype blocks exactly matched the pattern of AH responsiveness (calculated score = 0) among the inbred strains (Fig. 2, A and B). Within these two blocks, all six nonresponsive strains shared one haplotype, whereas the seven responder strains had one of two other haplotypes. Both of these haplotypic blocks were within the Ahr locus on mouse chromosome 12 (29.7 Mb). None of the other haplotypic blocks exactly matched the AH response pattern among the inbred strains. Furthermore, because the AH phenotype was measured in liver tissue, identification of the computationally predicted genes that were expressed in liver would also help to identify the causative genetic locus. Analysis of gene expression data indicated that only three of the nine haplotypic blocks whose strain groupings were best correlated with the phenotypic response were within genes (Ahr and Dnpep) expressed in liver tissue (Fig. 2B). In summary, haplotypic blocks within the Ahr locus were the only blocks having a perfect correlation with AH response phenotype, and the Ahr gene was expressed in liver tissue. Therefore, Ahr was the computationally predicted gene candidate selected for further analysis.

Fig. 2.

Genetic regulation of aromatic hydrocarbon responsiveness. (A) Aromatic hydrocarbon responsiveness of 13 different mouse strains was used as input phenotypic data for computational mapping onto haplotypic blocks. Two haplotypic blocks had a perfect correlation with the AHR response phenotype, and both were within the Ahr locus on chromosome 12 (indicated by the arrow). (B) The AH response phenotype for the 13 strains and a list of haplotypic blocks with the best correlation with the pattern of AH response among the inbred strains are shown. For each predicted block, the calculated score for partitioning of the phenotypic data, chromosomal location, and the number of SNPs within a block and its gene symbol are shown. The haplotypes for each strain are represented by colored blocks that are presented in the same order as shown for the strain phenotypes. Strains sharing the same haplotype have the same colored block. A calculated partition score of 0 indicates an exact match between the distribution of phenotypic responses and haplotypic groupings among the inbred strains. An asterisk (*) indicates that the gene was not expressed in liver tissue. (C) Three haplotypes within the Ahr locus on chromosome 12 (29.7 Mb). Of 211 SNPs within the Ahr locus, the seven exonic SNPs (labeled as a to g) that change an amino acid are shown. (D) The relative location of amino acid changes in the AHR protein for the three haplotypic groups. Four of the amino acid changes (c, d, e, and g) distinguish group I from the other two haplotypic groups, including an amino acid change (g) that converts a stop codon in the group I protein to an Arg in group II and III. Three other amino acid changes (a, b, and f) differentiate the group II and group III proteins. However, only a single SNP (b), which converts an Ala (group I and II) to Val (group III) and is located within the PAC domain, distinguishes the responder and nonresponder strains.

Analysis of genetic variation within the Ahr locus identified the precise molecular basis for the differential AH response among inbred strains. The 211 SNPs within the Ahr locus (2) divided the inbred mouse strains into three haplotypic groups (Fig. 2C). The AH-responsive strains were in haplotypic groups I and II, whereas the nonresponsive strains were in group III. Sixteen SNPs were located in exons, and 7 (labeled a to g in Fig. 2, C and D) altered the amino acid sequence of the encoded protein. One polymorphism converted a stop codon in the group I strains (B10 and C57BL/6) to an Arg in all other strains, adding 43 C-terminal amino acids to the AHR protein expressed by the group II and III strains. Three amino acid changes differentiated the group II from the group III strains. Although these polymorphisms induced amino acid changes and were previously thought to be of importance to this phenotype (11), their strain distribution indicated that they were not responsible for differences in AH response phenotype. The precision of these computational predictions is demonstrated by the fact that only one SNP distinguished the AH responsive and nonresponsive strains. This polymorphism (labeled b in Fig. 2, C and D) converted Ala444 in the AH-responsive (group I and II) strains to Val444 in the AH-unresponsive (group III) strains. This SNP was located within a PAS Associated C-terminal (PAC) motif that contributes to the folding of the ligand binding and dimerization region of this protein (12, 13). Consistent with the prediction of this computational genetic method, an Ala444 → Val substitution in an expressed recombinant AHR protein shifted the in vitro AH response from a responder to a nonresponder state (11, 14).

Gene expression profiles across inbred mouse strains provide a useful intermediate phenotype that can be analyzed to understand how phenotypic traits are genetically regulated. In the same manner as phenotypic trait information, strain-specific gene expression data can be computationally mapped onto haplotypic blocks to identify genetic loci that potentially regulate differential gene expression. For example, the level of H2-Eα gene expression in mouse lung, which is an MHC class II gene, differed by more than 20-fold across 10 inbred mouse strains examined using oligonucleotide microarrays (Fig. 3). Logarithm-transformed H2-Eα gene expression data were computationally analyzed using our haplotype-based computational mapping method. The haplotypic block within the first intron of the H2-Eα gene had the strongest correlation with H2-Eα gene expression. A Bonferroni-adjusted P-value of 0.001 for this prediction (table S1) supported the possibility that a cis-acting functional element within the H2-Eα gene could regulate its basal expression.

Fig. 3.

Cis-genetic regulation of H2-Eα gene expression in mouse lung. (A) The level of pulmonary H2-Eα gene expression is shown as the natural logarithm of the average of three independent measurements for each of 10 mouse strains examined. (B) Computational mapping of H2-Eα gene expression data onto haplotypic blocks. The haplotypic block with the highest prediction contained the H2-Eα gene.

To identify the cis-acting element, we analyzed 23 SNPs within this 1.0-kb haplotypic block (Fig. 4A). The nucleotide sequence within this haplotypic block was scanned to identify known recognition elements for transcription factors in the TRANSFAC database (15). Then, combinations of SNPs that were located within or near transcription factor recognition elements and that distinguished haplotypes within this block were selected. By this method, three haplotype-determining SNPs within this block, all located within a 30–base pair (bp) sequence at nucleotide positions +974, +975, and +999 downstream from the H2-Eα transcription start site, were selected for analysis. We examined the effect of these polymorphisms on reporter gene expression in a B lymphoma cell line. Two constructs with insertion of the H2-Eα promoter, first exon segments, and first intron segments were prepared (Fig. 4B). Construct II has the 10- to 1029-bp region of H2-Eα inserted, with the intronic region between positions +475 and +925 deleted. Genomic segments for construct II were amplified from three different inbred strains; each strain had a different haplotype resulting from the three polymorphisms discussed above. The intronic region containing these three polymorphic sites is not contained in construct I. Though similar to construct II, construct I contains only the 10- to 957-bp region of H2-Eα, with the intronic region between positions +475 and +925 deleted. The BALB/cJ haplotype in the first intron sequence resulted in a threefold enhancement of reporter gene expression, relative to plasmids containing the same regions amplified from the other two haplotypes (Fig. 4C). In fact, the AKR- and C57/BL6-derived intronic sequences did not enhance expression of construct II plasmids relative to the H2-Eα promoter and exonic sequence present in construct I. These results demonstrate that an allele-specific regulatory element is located between nucleotides +958 and +1029 in the first intron of the H2-Eα gene.

Fig. 4.

(A) The relative location of the haplotypic block containing a cis-acting functional genomic element in the H2-Eα gene. The filled boxes are exons and the exon number is indicated. The bold line below indicates the position of the computationally identified haplotypic block. The sequence of the oligonucleotides used in the gel-shift experiments are shown, and alleles at the three polymorphic sites among the inbred mouse strains are indicated in brackets. The YY1 and a possible SRF recognition element are underlined. The conversion of the CA to TG alleles eliminates the YY1 binding site. The CA..G allele–containing haplotype has the YY1 and SRF binding sites and also had high H2-Eα gene expression. (B) H2-Eα gene structure and reporter gene constructs. Construct II contains the 10- to 1029-bp region of H2-Eα, with a deletion of the 475- to 925-bp region. The shorter construct (I) contains the 10- to 925-bp region of H2-Eα, with deletion of the region between 475 and 925 bp. The yellow arrow indicates the location of the polymorphic regulatory element containing YY1 and SRF binding sites. (C) The haplotype-specific effect of the H2-Eα cis-acting element on reporter gene expression. The level of luciferase reporter gene expression in the A20 B lymphoma cell line after transfection of the indicated plasmids is shown. Luciferase assays were performed using two different construct I plasmids, which had genomic DNA obtained from either the AKR or Balb/cJ strains. Three construct II plasmids were prepared, and each had genomic DNA obtained from AKR, Balb/c, or 129 mice. These three strains represent each of the three different haplotypes identified within the first intron of the H2-Eα gene. The luciferase expression level was normalized for transfection efficiency by cotransfection with a β-galactosidase plasmid. (D) Allele-specific formation of protein-DNA complexes requires a YY1 recognition element. Three different 32P-labeled oligonucleotides, each with different alleles at three polymorphic sites, were incubated with nuclear extracts from lung tissue obtained from AKR (lane A), Balb/c (lane B), or C57 (lane C) mice. The arrowheads indicate the protein-DNA complexes. (E) YY1 is part of the allele-specific protein-DNA complex. EMSAs were performed using a 32P-labeled 45-bp CA..G oligonucleotide, containing a YY1 recognition element, and nuclear extracts of lung tissue were prepared as in (D). The extracts were preincubated with an antibody to YY1. The filled arrow indicates the position of the YY1 DNA complex, and the open arrow indicates the position of this complex after incubation with the antibody to YY1.

To further characterize the regulatory element within this intronic region, we performed electrophoretic mobility–shift assay (EMSA) experiments using three 45-bp oligonucleotides. Each oligonucleotide had a different SNP allele combination that corresponded to one of the three H2-Eα haplotypes at positions +974, +975, and +999 (Fig. 4A). The EMSA experiments demonstrated that DNA-protein complexes were formed after incubation of two different (CA...T and CA...G) oligonucleotides with nuclear extracts prepared from lung tissue. In contrast, the (TG...G) oligonucleotide did not form detectable protein complexes (Fig. 4D). The adjacent SNPs (CA/TG) were located within a Yin Yang 1 (YY1) transcription factor binding site, whereas the third SNP (T/G) was within a serum response factor (SRF) binding site. The CA-to-TG conversion eliminated the YY1 binding site, whereas the G-to-T change converts an SRF binding site to a potential recognition element for another Activator Protein 1 (AP1) transcription factor. YY1 binding to the CA-allele oligonucleotide was confirmed by “supershift” experiments performed by preincubation of the nuclear lysate proteins with an antibody to YY1. The mobility of the upper band was altered in the presence of the antibody, indicating that it was a complex of YY1 bound to the oligonucleotide (Fig. 4E). YY1 has been shown to be required for association of SRF with its serum response element (SRE) (16). Of note, only mouse strains with the CA...G haplotype exhibited high H2-Eα gene expression in the lung. Taken together, the EMSA, reporter gene, and gene expression results indicate that a 45-bp functional element within the H2-Eα gene locus regulates its basal expression level in an allele-specific manner. Within this functional element, alleles that enable the binding of YY1 and a second transcription factor, which is probably SRF, are required for high H2-Eα gene expression in the lung.

These examples demonstrate that a variety of genetically regulated phenotypic traits, ranging from qualitative trait information to quantitative gene expression data that vary among inbred strains, can be computationally analyzed by this method. The haplotypic blocks exhibiting a strong correlation with phenotypic data provide a valuable starting point that enables pathway analysis or mechanistic studies to be performed. As shown in the analysis of the H2-Eα gene expression and AH response data, a very precise hypothesis about the genetic basis for the phenotypic difference among mouse strains can be generated using this computational method. In these examples, the SNPs responsible for the phenotypic differences were predicted.

A number of factors affect the performance of this method. As shown here, it performs extremely well when the phenotypic data reflect the genetic variation within a haplotypic block in our SNP database. However, if haplotypic information for a key genetic locus or phenotypic information for a sufficient number of strains is not available, its performance is substantially reduced. When phenotypic data for only a few inbred strains are available, this method usually produces a large number of top predictions with comparable scores, and most are false-positives. We have empirically determined that phenotypic data from eight or more strains are needed to produce statistically significant predictions (table S1). The statistical power and SNP coverage issues will diminish as the numbers of inbred strains and SNPs in the database increase. A haplotypic map with more than 300,000 SNPs covering between 40 and 50 commonly used inbred mouse strains should be available within 2 years. At present, it is unlikely that this computational method can analyze traits controlled by multiple genetic loci, each with a small effect size. However, coverage of additional inbred strains should enable this computational method to have substantial power to identify genetic loci for a wider range of phenotypic traits, including those with increased underlying genetic complexity, than is currently possible.

Supporting Online Material

Materials and Methods

Table S1


References and Notes

View Abstract

Navigate This Article