Recombination in the Hemagglutinin Gene of the 1918 "Spanish Flu"

See allHide authors and affiliations

Science  07 Sep 2001:
Vol. 293, Issue 5536, pp. 1842-1845
DOI: 10.1126/science.1061662


When gene sequences from the influenza virus that caused the 1918 pandemic were first compared with those of related viruses, they yielded few clues about its origins and virulence. Our reanalysis indicates that the hemagglutinin gene, a key virulence determinant, originated by recombination. The “globular domain” of the 1918 hemagglutinin protein was encoded by a part of a gene derived from a swine-lineage influenza, whereas the “stalk” was encoded by parts derived from a human-lineage influenza. Phylogenetic analyses showed that this recombination, which probably changed the virulence of the virus, occurred at the start of, or immediately before, the pandemic and thus may have triggered it.

The 1918 “Spanish flu” pandemic was the most severe recorded outbreak of acute human disease and was also infamous because it killed an unusually high number of young adults (1, 2). Fragments of the genomic RNA of the 1918 virus were recently recovered from preserved tissues of three of its victims, and complete sequences for three genes, including the hemagglutinin (HA) gene, were reported (3–5). These sequences confirmed that the 1918 Spanish flu was caused by an influenza A of the H1 N1 subtype, but they did not reveal why the virus was so virulent (3–7).

The virulence of influenza A viruses is largely determined by their HA. Mutations in the HA gene have produced highly pathogenic strains, and the major pandemics of 1957 and 1968 were largely caused by the introduction of antigenically novel HA genes from bird-infecting influenzas (8–11). It has been suggested that the 1918 pandemic was similarly caused by the introduction of genes from an avian strain (6, 12), but this theory was not supported when sequences from the virus were obtained (3–5). Phylogenetic analyses showed that the 1918 virus was most closely related to H1 influenzas from mammals and suggested that progenitors of the virus had infected mammals for several years before 1918, implying that some additional event must have triggered the pandemic (3–7). New virulent variants of some other viruses have been generated by homologous recombination (13–15), but no evidence of this kind of genetic change has been found before in influenza virus populations (16, 17). Here, we report that the 1918 HA gene was a recombinant, and that the start of the 1918 pandemic and the recombination event were probably linked.

Complete HA gene sequences were analyzed from 30 H1-subtype isolates from the three main lineages (3): the lineages of isolates mostly from people, pigs, and birds. Sequences were aligned (18) and gaps (two codons) removed, producing an alignment 1695 nucleotides long. The mature HA protein consists of the NH2-terminal HA1 and COOH-terminal HA2 polypeptides; the first 1026 nucleotides of our alignment encoded the HA1 and the remainder the HA2.

Every possible combination of three sequences from the aligned set was examined by the sister-scanning method (19) using, as outlier, a fourth sequence generated by local randomization. Four HA gene sequences were identified as likely recombinants—those of the 1918 influenza (A/South Carolina/1/18) and three Iowa-cluster sequences: A/swine/Iowa/15/30 (Iowa), A/Alma Ata/1417/84 (Alma Ata), and A/swine/St-Hyacinthe/148/90 (St-Hyacinthe). Different regions of these genes contained dominant signals that were conflicting (Fig. 1) but significant (Z scores >3.0) when compared with several combinations of HA sequences from isolates from pigs and humans. Two possible recombination sites were found in the 1918 sequence and three in the Iowa-cluster sequences; all of these, except one of the sites in the Iowa cluster, were also found using a maximum likelihood (ML) method for detecting recombination (14) and were shown to be statistically significant in Monte Carlo tests using the ML method (20). Recombination, rather than convergence caused by selection, was shown to be the cause of the conflicting relatedness signals, because sister-scanning analyses using only synonymous or third-codon differences also gave statistically significant signals (Fig. 1, C and E). The conflicting signals were found only when one reference sequence was from a swine-lineage isolate and the other from a human-lineage isolate, and not in comparisons involving sequences from avian isolates, which form an outgroup to the other lineages. Split decomposition analysis (21) supported the conclusion that the older lineages had not evolved by simple bifurcating speciation (22).

Figure 1

Conflicting phylogenetic signals in HA gene nucleotide sequences. (A) Percentage of identity between the HA gene of the 1918 influenza (A/South Carolina/1/18) and the HA genes of a human-lineage influenza (A/Kiev/59/79; dotted plot) and a swine-lineage influenza (A/swine/Wisconsin/1/61; unbroken plot). Variable positions were scored using a window of 200 positions that was moved in steps of 20 positions. (B) Z scores calculated by the sister-scanning method (19) for the identity scores of the same sequences. Scores were calculated after Monte Carlo randomization within columns in the alignment. (C) Z scores calculated only from identities at synonymous sites in the HA genes of the 1918 influenza and a human-lineage isolate (A/Suita/1/89; dotted plot) and a swine-lineage isolate (A/swine/Wisconsin/1/61; unbroken plot). (D)Z scores calculated for identities at all variable sites between the HA gene of the Iowa influenza (A/swine/Iowa/15/30) and the HA genes of the isolates A/Kiev/59/79 (dotted plot) and A/swine/Wisconsin/1/61 (unbroken plot). (E) Zscores calculated for identities at variable third-codon positions between the HA gene of the Alma Ata influenza (A/Alma Ata/1417/84) and those of the isolates A/Kiev/59/79 (dotted plot) and A/swine/Wisconsin/1/61 (unbroken plot). P values, shown above the likely recombination sites, were calculated by comparing likelihood ratios obtained from the same sequences with ratios obtained from 200 simulated sequence data sets that were not recombinant but were of equal length and similar composition and heterogeneity (14, 20).

The results indicated that the 1918 sequence was generated by recombination between two parental HA genes that were detectably different and were related to the extant human-lineage and swine-lineage HA genes. About 150 nucleotides at the 5′ terminus and 775 nucleotides at the 3′ terminus of the 1918 HA gene were derived from the human-lineage parental gene, and the sequence between these terminal regions was derived from the swine-lineage parental gene (Fig. 1, A and B). These regions of the HA gene encode structurally distinct domains of the HA: The swine-lineage sequence encodes the globular domain of the HA1 polypeptide that includes the major antigenic sites, the host cell receptor-binding site, and almost all the glycosylation sites (23), whereas the human-lineage sequence encodes the NH2- and COOH-terminal parts of the HA1 and all the HA2 polypeptide, which together form the stalk that anchors the HA to the lipid outer layer of the virion.

Likely early events in the evolution of the mammalian H1-subtype HA genes revealed by the recombination analyses are summarized in Fig. 2. An ancestral avian influenza H1 HA gene (3, 6) became established in mammals and diverged into two lineages before 1918. These produced the 1918 HA gene by recombination in a mixed infection. We found no evidence of recombination in any human-lineage HA genes, which indicated that they probably evolved from one of the parental lineages and not directly from the 1918 gene. The 1918 HA gene probably died out after the pandemic, because no other HA gene with its pattern of affinities is known.

Figure 2

A scheme of the likely course of events in the early evolution of the HA genes of the 1918 influenza and the human and swine lineages after the original bird-to-mammal host switch (triangle). Recombination events between HA genes are represented by circles and near-horizontal lines that join the parental lineages. Shaded bars represent the HA genes, with darker regions for human-lineage sequences and paler regions for swine-lineage sequences. The 1918 lineage and an early swine lineage probably became extinct (dotted lines).

The HA genes of the Iowa cluster contain at least two, and probably three, recombination sites (Fig. 1, D and E). Two of these sites are located at positions close to those in the 1918 sequence, which suggests that they were inherited; the third is close to the HA1-HA2 boundary. This third site probably marks a recombination event in which the Iowa-cluster HA gene was produced when the HA1-encoding sequence from a 1918 lineage gene was joined by recombination with a HA2-encoding sequence from a distinct swine-lineage gene. This is plausible because Iowa, the oldest swine isolate, is believed to have descended from the 1918 virus (11, 24), and Alma Ata and St-Hyacinthe are descendants of Iowa (25, 26).

It has been argued that the 1918 virus probably emerged from birds into mammals immediately before the pandemic, because there is little evidence of change in the 44 amino acid residues that probably contribute to the four antigenic sites in the 1918 HA1 globular domain (3, 6). Our results suggest, however, that the antigenic sites have changed slowly, because the entire globular domain was derived from a strain that infected pigs; antigenic sites also change relatively slowly when influenzas infect this host (11).

Phylogenetic trees were inferred from the sequences by a ML method (27). To avoid topological errors resulting from recombination (28), we inferred separate phylogenetic trees (Fig. 3) for the two longest regions of the sequences that did not include recombination sites, namely, regions within the sequences encoding the globular and HA2 stalk domains. We excluded the HA sequences of avian isolates—which were used previously as outliers to root HA trees (3,11)—because the position of the branch joining the avian sequences to a tree varied widely depending on the parameters of the ML model and because there was clear evidence that the sequences from bird and mammal isolates had been subjected to different modes of selection; they have distinct transition/transversion ratios (22) and synonymous/nonsynonymous ratios.

Figure 3

Phylogenies of the nucleotide sequences encoding a part of the globular domain of the HA1 and a part of the HA2 domain of mammalian H1 influenzas. ML trees were inferred from aligned nucleotide positions 310 to 870 (A) and 1070 to 1650 (B) by the quartet puzzling method (27) using the Tamura-Nei formula to model substitution (32). Transition/transversion ratios, purine/pyrimidine transition ratios, and nucleotide frequency parameters were estimated from the data through several rounds of optimization, as was an eight-parameter gamma distribution of the rates of change for variable sites. Both trees are rooted at the midpoint between the nodes for the Iowa 30 and NWS 33 sequences. Bootstrap values calculated from 500 samples by maximum parsimony are shown for the major branches (gray boxes). Dates of isolation are shown next to taxa names. The code names of the isolates and the GenBank accession codes for their HA gene sequences are as follows: A/South Carolina/1/18 (AF117241), A/swine/Iowa/15/30 (AF091308), A/Alma Ata/1417/84 (S62154), A/swine/St-Hyacinthe/148/90 (U11703), A/swine/Wisconsin/1/61 (AF091307), A/swine/Illinois/63 (X57493), A/swine/New Jersey/11/76 (K00992), A/swine/Ehime/1/80 (X57494), A/swine/St-Hyacinthe/106/91 (U11857), A/swine/Nebraska/1/92 (S67220), A/swine/Wisconsin/457/98 (AF222034), A/WSN/33 (J02176), A/NWS/33 (U08903), A/PR/8/34 (NC_002017), A/swine/Cambridge/39 (D00837), A/Tokyo/3/67 (U38242), A/Mongolia/153/88 (Z54287), A/Fort Monmouth/1/47 (U02464), A/Leningrad/54/1 (M38312), A/Kiev/59/79 (M38353), A/USSR/90/77 (K01330), A/CHR/157/83 (X17221), A/Mongolia/231/85 (Z54286), A/Suita/1/89 (D13573), A/Mongolia/162/91 (Z54289), A/swine/Scotland/4104/94 (AF085413).

In the ML trees, the sequences grouped into the swine and human lineages, as expected (Fig. 3). In the globular domain tree (Fig. 3A), the 1918 partial sequence was always on the swine-lineage side of a midpoint root, and in the HA2 tree it was always on the human-lineage side of this root, regardless of how the midpoint was estimated. In the globular domain tree, the ML patristic distance from the 1918 sequence to the closest swine-lineage sequence (0.072) was half of that to the closest human-lineage sequence (0.144), but in the HA2 tree, the 1918 sequence was closer to a human-lineage sequence (0.061) than to a swine-lineage sequence (0.076). Uncorrected evolutionary distances, calculated directly from pairs of sequences, also indicated a change in the relationships (22). These results were consistent with our findings that the 1918 HA was not the parent of both HA lineages and that the two parts of the 1918 gene came from different parts of distinct and older genes.

In the globular domain tree (Fig. 3), the 1918 sequence was placed on the “trunk” connecting the human and swine lineages; in the HA2 tree, it was joined to the trunk by a very short branch. The length of that branch indicated that the HA2-encoding sequence of the 1918 virus differed from the predicted ancestral (trunk) sequence at only 0.4% of sites. Genes of H1-subtype influenza viruses infecting mammals have accumulated changes at 0.6 to 1.2% per year since the 1930s (11), but rates as high as 3.9 to 7.9% per year have occurred immediately after a host switch (29, 30). Our analysis indicated that between 1918 and 1933 the HA2 nucleotide sequences of the human lineage changed by about 0.4% per year, although the rates were so variable that both data sets failed an ML molecular clock test. Because the progenitors of the 1918 virus probably switched hosts from birds to mammals sometime after 1900 (1, 6), it is likely that the 1918 HA gene changed at a rate of 0.4 to 8.0% per year after it was generated. Thus, using the predicted sequence difference of 0.4% and the likely range of rates, we estimate that the recombination-to-preservation time was less than 1 year.

The victims from whom the 1918 influenza sequences were obtained died in the major “second wave” of the pandemic in late September and October 1918 (2, 3); thus, the 1918 HA gene was probably generated in late 1917 or early 1918. The “first wave” of the pandemic was in early 1918 (2), but the first outbreaks may have been in late 1917. Hence, the start of the pandemic coincided with a recombination event that might produce the phenotypic novelty required to trigger a pandemic. This coincidence suggests a causal link.

Recombination, like point mutation and reassortment, produces novel virus variants and can result in increased virulence (13–15). Because the HA gene is the major virulence determinant (3–11), recombination in this gene may have similarly altered the 1918 virus. The parental H1 HA genes would have been progressively altered by point mutation after their divergence; we estimate that they differed at up to 30 amino acid positions at the time of the recombination, and that the 1918 HA differed from each of its parents at about half as many positions. Recombination may have altered the antigenicity of the HA so that the immunity of those who had survived earlier infections was ineffective. Similarly, the membrane-fusion or receptor-binding function of the HA protein may have changed (3, 31), and this may have given the 1918 virus an unusual tissue specificity, such that it spread from the upper respiratory tract to the lungs. Experiments comparing reconstructed 1918 and parental HA proteins may distinguish between these possibilities.

Our analysis suggests that the two parental lineages were probably mammal-adapted and capable of mammal-to-mammal transmission, and yet they did not generate a pandemic. It is possible that the recombination event triggered the pandemic not only by altering HA structure or function, but also by permitting the virus to outcompete these parents or to be the first of these H1-subtype influenzas to switch hosts from some other mammal into humans.

  • * To whom correspondence should be addressed. E-mail: mark.gibbs{at}


View Abstract

Navigate This Article