Founder Effects in the Assessment of HIV Polymorphisms and HLA Allele Associations

See allHide authors and affiliations

Science  16 Mar 2007:
Vol. 315, Issue 5818, pp. 1583-1586
DOI: 10.1126/science.1131528


Escape from T cell–mediated immune responses affects the ongoing evolution of rapidly evolving viruses such as HIV. By applying statistical approaches that account for phylogenetic relationships among viral sequences, we show that viral lineage effects rather than immune escape often explain apparent human leukocyte antigen (HLA)–mediated immune-escape mutations defined by older analysis methods. Phylogenetically informed methods identified immune-susceptible locations with greatly improved accuracy, and the associations we identified with these methods were experimentally validated. This approach has practical implications for understanding the impact of host immunity on pathogen evolution and for defining relevant variants for inclusion in vaccine antigens.

HIV escapes the immune system of the human host by mutation, and thus mutational patterns may identify regions important for host immune recognition. Cytotoxic T lymphocytes (CTLs), part of the adaptive immune response, recognize short peptide fragments called epitopes, cleaved from viral proteins and presented on the surface of infected cells by human leukocyte antigens (HLAs). Rapid emergence of sequence variation within some HIV epitopes provides clear evidence for host-driven immune selection during infection (14). HLAs are highly polymorphic and can only present viral epitopes that have the appropriate amino acid composition to enable binding, and so people carrying the same HLA allele have a shared potential to recognize the same epitopes. The viral sequence is, therefore, expected to show patterns of mutations that correlate with the host's HLAs. A population study in Perth, Australia (5), indicated that 43% of positions in HIV Pol protein had polymorphisms potentially associated with HLA A and B alleles.

Although the Moore et al. study used polymorphisms elsewhere in the HIV protein as well as HLA alleles as explanatory variables in the multiple regression analysis, it did not explicitly control for viral lineage founder effects: Relatively closely related sets of viruses among contemporary viral sequences share some amino acids simply by virtue of common descent and should not be treated as independent. We have now developed a methodology to account for these genetic relationships. Newer sequence data from Perth (6), consisting of intermittent fragments spanning different genomic regions of HIV-1 from 234 individuals along with high-resolution HLA genotyping results, were used to explore the importance of lineage effects (6, 7). Because reliable phylogenetic analysis requires long continuous stretches of sequence data, we restricted our analyses to the longest continuous sequence length available from a reasonably large set of subjects, a 1732–base pair (bp) region starting in p17 gag and ending in pol from 96 sequences (82 subtype B, 2 D, 8 C, and 4 circulating recombinant form CRF01 sequences).

We first analyzed all 234 available sequences over this 1732-bp stretch using methods similar to those of (5) and including HLA class II alleles, so that we could compare our methods directly to those previously employed. We found 346 associations between host HLA and HIV mutations, with an uncorrected P-value of <0.05. Correcting for multiple tests, 80 out of 346 associations had a q-value (8) of <0.2 (that is, an estimated 80% of these are true positive correlations). Of these 80 associations, 32 were with HLA C*1701, and viral subtyping of the sequences bearing the associated substitutions revealed that all were subtype C virus, in this primarily subtype B cohort. Importantly, subtype C is prevalent in southern Africa, and HLA C*1701 is common only in Africa (9). Thus, any substitutions distinguishing the C and B subtypes would appear to be associated with C*1701. Sixty out of 80 associations could similarly be explained by an association of the HLA class I allele with the subtype of the virus (Table 1, Fig. 1A) and, consequently, most likely explained by the demographic and geographical structure of the HIV epidemic rather than immune pressure.

Fig. 1.

Phylogenetic trees illustrating associations due to subtypes (A), lineages within the subtypes (B), and HLA-driven escape (C). Maximum likelihood is used to infer the tree and the sequences at internal nodes. The probability of the amino acid studied is indicated by the numbers 0 through 9 or the letter X: 0 represents < 0.05, 1 between 0.05 and 0.15, etc., and X > 0.95. The color of the symbol indicates the most likely amino acid. Subjects with the HLA under consideration are followed by a magenta line; those without the HLA, a dark gray line; and controls with unknown HLA, a light gray line. (A) The first contingency table shows that leucine and C*1701 are significantly associated if the tree is not considered, but the second table shows that counting only the changes from valine in the immediate parent in the tree to a nonvaline state at the leaf abrogates the signal. (B) Clusters resulting from a few ancestral mutational events, rather than many independent events, drive a false association with HLA A*0301 when the tree is not taken into account. (C) Tree-based analysis supports a causative influence of HLA: Presumed escape mutations from asparagine to other amino acids are independent events arising specifically when the HLA is present. Abbreviations for the amino acid residues are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.

Table 1.

Summary of the 60 HIV mutation-host HLA associations driven by different subtypes. The number of associated sequence positions for each HLA is given in the “No. of associations” column. Populations with high frequencies of a given HLA are noted (1416) and are as expected given the subtypes: CRF01 is common in Asia, C in Africa, and B among Europeans, North Americans, and Australians ( The HIV molecular immunology database was searched ( for HLA-related epitopes that span the site of interest; the only one found was A2 epitope TLQEQIGW (17). Associations embedded in potential epitopes 8 to 12 amino acids long based on HLA anchor motifs ( are also noted. Trees used for reanalysis of all 80 HLA–base pair associations can be found at

HLASubtype/CRFHLA-enriched populationsNo. of associationsNo. of known epitopesNo. of potential epitopesMotifs
C*1701 C African 32 0 4 .A......L
DRB1*1401 CRF01 Oriental 18 0 - Unknown
C*0102 CRF01 Oriental 1 0 0 ..AL......L
C*0702 B Caucasoid 1 0 0 .........YFL
A*0201 B Caucasoid 1 1 1 .LM......VL
A*0207 CRF01 Oriental 3 0 1 .L......L
C*0701 B Caucasoid 2 0 0 .RHK......Y
Total 60 1 7

Although confounding effects due to mixed subtypes are relatively easy to detect, viral lineage effects that give rise to within-subtype phylogenetic clusters are also relevant. We therefore devised two strategies to account for the fact that viral sequences related by phylogeny are not statistically independent samples. First, a maximum-likelihood phylogeny was used to infer the mutations that lead to the observed viral sequences, and these inferred mutations were tested for correlation with the HLA of the host carrying the sequence (Fig. 1). We also used a likelihood-ratio test to locate sequence positions where postulating HLA pressure, in addition to the inferred phylogenetic structure, better explains the data. These two statistics yielded similar predictions (Table 2), and both support lineage founder effects, rather than HLA-mediated immune escape, for all 60 associations in Table 1.

Table 2.

Phylogenetic analyses of HIV mutation-host HLA associations not explained by HIV subtype. Levels of tree-based support for HLA immune escape–driven associations are as follows: strongly supported, moderately supported, weakly supported, or not supported. The HLA allele, relevant amino acid (aa), and HXB2 position of the variable nucleotides and amino acids ( are listed. The first four statistics indicate the level of support in the tree for an amino acid changing in people with the HLA allele and include (i) the P-value in the tree including all 96 sequences, (ii) the P-value in the tree excluding possible intrasubtype recombinant sequences, (iii) support found in 80% or more of bootstrap trees, and (iv) a q-value based on screening all HLAs by all positions for tree-based evidence for associations. The P- and q-values for the likelihood-ratio tree-based statistic are given next, and then the P- and q-values for the uncorrected associations found in the full set of 234 sequences. If an HLA-appropriate epitope spans the position, it is noted; anchor residues are in bold, and the HLA-correlated amino acid is underlined. The epitopes shaded in gray were tested experimentally by interferon-γ ELIspot (fig. S1). The two strongest HLA-nucleotide associations were with B*4001 and were embedded in the same codon; thus, the impact of the E to not-E association with B*4001 is the same for both. There were not enough A*3101-positive subjects in the 96 sequences used for the baseline tree to obtain meaningful results, so we added back HIV sequence fragments from A*3101 individuals and made an additional tree from a shorter alignment 653 nucleotides long.

HLAaaHXB2 positionProteinP-value tree 1P-value nonrec80% bootstrapq-value tree 1P-value tree 2q-value tree 2P-value cohortq-value cohortKnown epitope (ref), reactivityHLA motifPotential epitope
Correlations strongly supported as escape by evidence for selection in the tree (q < 0.2)
B*4001 E 2233 p6 0.00002 0.00001 ≤0.0001 0.04 1 × 10-5 <0.001 0.0001 0.04 KELYPLTSL (View inline, View inline) .E......L
B*4001 E 2235 p6 NA NA NA NA NA <0.001 0.066 NA KELYPLTSL (View inline, View inline) .E......L
A*3101 K 2113 p1 0.0003 NA NA <0.001 NA <0.001 0.061 NA KIWPSYKGR .........R
A*3101 R 1996 p7 0.0004 NA NA <0.001 NA <0.001 0.055 NA LARNCRAPRK (View inline) .........R
B*4002 N 1903 p2 0.0004 0.0003 ≤0.0004 0.19 0.0003 0.37 0.0008 0.13 AEAMSQVTNSView inline (View inline) .E......IAVL
B*1501 I 2529 Pol 0.0005 0.006 ≤0.092 0.19 6 × 10-5 0.14 4 × 10-6 0.005 TQIGCTLNF .AST......FWY
A*3101 K 1978 p7 0.0013 NA NA 0.0008 NA 4 × 10-5 0.022 NA .........R CGKEGHTAR
Correlations moderately supported as escape by evidence for selection in the tree (q ≥ 0.2, q < 0.9)
B*5701 T 1513 p24 0.0007 NA NA 0.220 0.0003 1 × 10-7 0.000 TSTLQEQIGW .AST......FWY
B*1501 V 2481 Pol 0.0057 0.0006 ≤0.034 0.524 0.01 0.76 0.0011 0.158 .QL......YF VLVGPTPVN
B*0702 G 1858 p24 0.0174 0.03 ≤0.059 0.694 0.03 0.88 0.0004 0.089 GPGHKARVL (View inline) .P......L
C*1203 V 1957 p7 0.0246 0.0156 ≤0.028 0.742 0.01 0.77 0.0010 0.153 .A......FWY NQRKIVKCF
A*0101 V 1216 p24 0.0336 0.0355 ≤0.034 0.796 0.02 0.85 0.0002 0.066 .DE......Y
DQB10301 P 985 p17 0.0461 0.0204 ≤0.046 0.849 0.03 0.89 0.0002 0.061 Unknown
Correlations weakly supported and not contradicted as escape
A*2301 N 2361 Pol 0.0942 0.0587 ≤0.096 0.927 0.03 0.89 0.0002 0.061 Unknown
B*4402 N 2361 Pol 0.0963 1.0 ≤0.11 0.927 0.009 0.75 0.0016 0.197 EEMNLPGRW (View inline) .E......YF
B*4002 E 1981 p7 0.1042 0.07 ≤0.10 0.927 0.14 0.94 0.0007 0.115 E......IAVL KEGHTARNCRA
DRB10401 D 1723 p24 0.1101View inline 0.3351 ≤0.25 0.927 0.22 0.97 0.0003 0.081 FLV.......NQST
C*0602 I 1228 p24 0.1828 0.3684 ≤0.37 0.965 0.10 0.95 0.0002 0.061 ............LIVY HQAISPRTL
Evidence for escape contradicted due to sublineages in the tree
DQB10502 A 1225 p24 0.2329 0.0669 ≥0.11 0.975 0.12 0.95 0.0013 0.171 Unknown
A*0301 A 949 p17 0.5779View inline 1.0 ≥0.20 0.991 0.72 0.99 0.0002 0.061 LVM......KYFR ALETSEGCR
  • View inline* This is a B*4501 epitope; it is the same supertype as B*4002, so may cross-present.

  • View inline These P-values are not for escape, as are all others, but for reversion to the susceptible form in the absence of the HLA allele.

  • Twenty associations remained that were not subtype driven (Table 2). In two cases, the uncorrected signal arises from a few clusters of viral sequences, rather than many independent events (Fig. 1B and Table 2). In both cases, the relevant within-clade clusters were not themselves the consequence of HLA-mediated selection, because they were also found in trees generated from only silent substitutions, which should be independent of immune pressure. Seven of the original 20 associations were strongly supported as HLA-mediated escape or reversion, after consideration of the phylogeny (Table 2 and Fig. 1C), and 13 additional associations, missed by the simple analysis, were also identified (table S1). These associations were also supported by trees that excluded potential intrasubtype recombinants and by a bootstrap analysis (Table 2).

    To validate our findings, we scanned for associations embedded in known epitopes or in epitopes predicted on the basis of motifs that could act as anchors for HLA binding. Notably, of the 62 cases where the phylogeny is indicated as the underlying cause of the associations (Tables 1 and 2), only one site was embedded in a previously defined HLA epitope. Forty-three of these 62 associations were with HLAs that have described anchor motifs; only 8 out of 43 were embedded in potential epitopes suggested by these motifs. In contrast, 6 out of 7 associations validated by our phylogenetic methods (Table 2) are embedded in HLA-appropriate epitopes, in accord with expectation given a q-value cutoff of 0.2, indicating the strong specificity of the approach. Four out of 7 associations were previously defined experimentally, and the other three were embedded in predicted epitopes (Table 1). Elispot screening of individuals with the appropriate HLA experimentally confirmed that two of the other three associations were embedded in epitopes (fig. S1). Intriguingly, in several cases, the HLA-associated variants were the escape form in some individuals, as predicted, but were the more susceptible form of the epitope in other individuals (fig. S1). These variants differed in positions that affect recognition by T cell receptors.

    If founder events are indeed an important confounding influence, methods that do not correct for the phylogeny should also yield associations between silent mutations (that do not alter the amino acid sequence) and host HLA. In the 51 codons in the HIV Gag protein that encoded invariant amino acids, we found 10 variable but silent positions associated with HLAs with q < 0.20, a rate statistically indistinguishable from the amino acid–altering mutation–HLA correlations (Fisher's exact P = 0.64). The lowest P-values among the silent mutations were with HLA-C*1701 and DRB1*1401 and were subtype-driven, as in the amino acid–altering cases (Table 1). In contrast, we found no spurious correlations between silent substitutions and HLA using our tree-corrected method.

    We also applied our phylogenetic analyses to a data set provided by the group in Perth that closely approximates the sequence data used in the original Moore study data (5) (GenBank accession numbers: DQ409341 to DQ409813) and serologically defined HLA alleles. These 446 sequences consisted of 406 intact subtype B, 23 C, 16 A, and one D sequence, suggesting the potential for subtype-driven associations. Our uncorrected methods yielded 28 correlations between amino acids and HLA types with q-values < 0.2. Nine of these had a q <0.2 based on phylogeny-corrected methods, and 14 were due to lineage effects. We also reexamined the 12 associations reported in (5) to have survived correction for multiple tests. The seven among these noted to be embedded in or proximal to known CD8 T cell epitopes in the original study [the red- and blue-boxed associations in the supplemental figure of Moore (5)] were validated with our tree-based methods, in contrast to the other five (the black-boxed associations).

    Associations between an HLA allele and a subtype consensus amino acid may be the consequence of immune selection in a population with a relatively high frequency of the presenting HLA (1, 5, 8). Leslie et al. (10) explored this idea in a detailed characterization of two common variants that confer CTL escape. We reevaluated these data in a phylogenetic context and found that the amino acids associated with CTL escape were likely to have been in the founder virus of the subtype, because the escape amino acid also dominates all phylogenetically related subtypes, regardless of the frequency of the associated HLA allele (Fig. 2). Furthermore, the relative frequency of the escape and nonescape forms has stayed constant over time, suggesting an equilibrium situation rather than selection over time favoring the escape form (fig. S2). Thus, the interpretation of these data is also complicated by the phylogenybecause in this case the associations were within clearly defined CTL epitopes, they may represent immunologically selected stable mutations present in the founder population.

    Fig. 2.

    A full-length genome maximum-likelihood tree superimposing the patterns of escape mutations featured in (10) against an evolutionary backdrop of M-group viruses. Maximum-likelihood ancestral reconstructions suggest that the escape forms of these two epitopes have dominated the subtypes in question since their origin. In the Nef epitope (A), glycine dominates subtype C and all subtypes found in cluster 1; the probability that the ancestral state of the C clade was glycine is 0.740. In the Integrase epitope (B), valine dominates subtype B, and all surrounding subtypes in cluster 2; the probability that the ancestor of subtype B carried the escape mutation is 0.999.

    Correlation studies based on previous methods can lead to both false-positives and false-negatives. Many of the signatures of HLA-associated immune escape from previous population-level studies of chronic HIV infection are not supported, and incorporating phylogenetic corrections improves the accuracy of their identification. Furthermore, though immune escape is often observed in individuals (1, 2) and will affect the population frequency of the escape variants, contemporary amino acid frequencies also depend critically on founder effects.

    Many confounding factors can obscure identification of HLA-mediated immune escape in population studies: HLA alleles often cross-present the same epitopes, amino acids may be embedded in multiple overlapping epitopes (11), an escape variant in one person may be susceptible in another (fig. S1), and compensatory changes to maintain fitness may confound associations (4, 12, 13). Finally, more data will likely reveal many more associations. Thus, one should not interpret our results as evidence that immune pressure is a weak force in HIVevolution. Rather, we demonstrate that phylogenetic effects need to be accounted for in locating associations resulting from immune selection pressure in population-level studies. If vaccine antigen designs ultimately incorporate immune susceptibility and escape patterns, accuracy in defining these patterns is essential.

    The methods developed here are general and can be applied to any search for phenotypic correlations with sequence data. The question of HLA-driven immune escape is of vital importance to the HIV field; by taking genetic lineages into account, a very different interpretation of apparent HLA associations emerges. Contemporary immune selection is merely one contributing factor in a complex array of competing selective forces shaping currently circulating global virus populations.

    Supporting Online Material

    Materials and Methods

    Figs. S1 and S2

    Table S1


    Alignments and Phylogenetic Trees

    References and Notes

    Stay Connected to Science

    Navigate This Article