Transcription Start Regions in the Human Genome Are Favored Targets for MLV Integration

See allHide authors and affiliations

Science  13 Jun 2003:
Vol. 300, Issue 5626, pp. 1749-1751
DOI: 10.1126/science.1083413


Factors contributing to retroviral integration have been intractable because past studies have not precisely located genomic sites of proviruses in sufficient numbers for significant analysis. In this study, 903 murine leukemia virus (MLV) and 379 human immunodeficiency virus–1 (HIV-1) integrations in the human genome were mapped. The data showed that MLV preferred integration near the start of transcriptional units (either upstream or downstream) whereas HIV-1 preferred integration anywhere in the transcriptional unit but not upstream of the transcriptional start. Defining different integration site preferences for retroviruses will have important ramifications for gene therapy and may aid in our understanding of the factors directing the integration process.

Retroviruses have been used as efficient gene-delivery vehicles in many gene-therapy trials. Historically, the integration events of retroviruses were believed to be random, and the chance of accidentally disrupting or activating a gene was considered remote. Recently, 2 of 11 children treated for a rare blood disease with an MLV-based gene-therapy vector developed leukemia at least in part because of independent insertions of the MLV provirus near the same growth-promoting gene, LMO2 (1, 2). Thus, the safety of these treatments has become a primary consideration, and the assumption of random integration is in doubt. Although in vitro integration models have identified many important factors for integration site selection, such as nucleosomal structure and DNA binding proteins (38), integration site selection in vivo still remains poorly understood, and no consensus sequences have been determined in the primary flanking sequences of target-site DNA. Before the sequence of the human genome was available, it was impossible to obtain an accurate global picture of retroviral integration events. Early in vivo studies have produced conflicting results, with some reporting that transcriptionally active regions are favored for retroviral integration (9, 10) and others reporting that transcriptionally active regions are disfavored (11). Recently, Schroder et al. mapped over 500 integrations of HIV-1 in the human genome and reported that HIV-1 integration favored genes (12). Whether this preference is specific for HIV-1 or applies to other retroviruses, particularly MLV, which is a widely used gene-therapy vector, was not known.

We designed a high-throughput method to rapidly clone the genomic regions adjacent to proviral integrations (13). HeLa cells were infected with pseudotyped MLV or HIV-1 and harvested 48 hours postinfection. Similarly, wild-type HIV-1 was also used to infect human H9 cells. Genomic DNA was isolated from each sample and subjected to linkermediated, nested polymerase chain reaction (PCR) using a combination of long terminal repeat and linker-specific primers. We performed high-throughput cloning and sequence analysis of the cloned PCR products. For MLV, out of 2304 total sequencing reads, we obtained 1379 sequences meeting our validity criteria (13). Of those 1379 reads, we were able to unambiguously map 903 different integration sites in the human genome (14). Only 16 integration sites were sequenced more than once and none more than twice, suggesting we were well below saturation of our integration site library. With the use of the same method, we also mapped 244 integrations of infectious HIV-1 in the human H9 cell line and 135 integrations of recombinant pseudotyped HIV-1 in HeLa cells.

For the initial comparison of MLV and HIV-1 integration sites, we defined an integration as having landed in a gene only if it was between the transcriptional start and transcriptional stop boundaries of one of the 18,214 RefSeq (15, 16) genes mapped to the human genome. RefSeq genes are curated on the basis of known mRNA transcripts and do not rely on gene prediction programs, thus avoiding any potential computational bias. Our data showed that 62% (152/244) of HIV-1 integrations in H9 cells landed in RefSeq genes and 50% (67/135) of pseudotyped HIV-1 integrations in HeLa cells landed in RefSeq genes. Because these differences were not statistically significant, we combined them to show that 58% of the HIV-1 integrations into the human genome landed in RefSeq genes. We also analyzed the HIV-1 data from Schroder et al. (12) with the use of our criteria (13). The analysis showed that 61% of HIV-1 integrations landed in RefSeq genes, which is consistent with our results.

For the MLV integrations, 34% of the integrations (309/903) landed in RefSeq genes. This is significantly different from the result from a set of 10,000 computer-simulated random integrations, of which only 22.4% landed in RefSeq genes (χ2 test, P < 0.0001). It is also significantly different from that of HIV-1 (χ2 test, P < 0.0001) (Table 1). We demonstrated that the identified MLV and HIV-1 integrations were not biased by the linker-mediated–PCR technique used to isolate them (13).

Table 1.

MLV and HIV-1 integration site distributions. The total numbers of mapped integrations were 903 and 379 for MLV and HIV-1, respectively. The HIV-1 data are pooled integration data from pseudotyped and infectious HIV-1. The random data are from a set of 10,000 computer-simulated random integrations.

Percentage of integrations
MLV HIV-1 Random
Within RefSeq genes 34.2View inline,View inline 57.8View inline 22.4
Within 5 kb upstream of genes 11.2View inline,View inline 2.9 2.1
Within 5 kb downstream of genes 3.4 4.5 2.1
Within ± 5 kb transcription start sites 20.2View inline,View inline 10.8View inline 4.3
Within ± 1 kb CpG islands 16.8View inline,View inline 2.1 2.1
  • View inline* P < 0.0001 compared to random integration with the use of a χ2 test.

  • View inline P < 0.0001 compared to HIV-1 integration with the use of a χ2 test.

  • Because the two sets were clearly different, we analyzed whether the promoter regions of genes were favored target sites for MLV and HIV-1 integration. No accurate coordinates for the promoter regions of RefSeq genes are available, so we analyzed integrations that landed in various window sizes on either side of the +1 start site for RefSeq genes. As shown in Fig. 1A, the smaller the window size surrounding the transcriptional start site, the higher the density of observed integrations. The number becomes too small to draw valid conclusions when the window size is smaller than 1 kb. In contrast, the percentage of HIV-1 integration sites that landed in the 5 kb upstream regions of RefSeq genes is indistinguishable from that of random placements (Fig. 1B).

    Fig. 1.

    (A) With the use of the 903 mapped integrations, we selected windows of varying sizes from 1 to 10 kb upstream and downstream of the transcriptional start site for all RefSeq genes. The total number of MLV integrations in each window was counted, and an average integration rate per kb was calculated. The dashed line represents the expected number of integrations/kb (SOM Text). (B) Graph showing the percentage of total integrations for MLV and HIV-1 in three separate regions of the RefSeq transcripts: 5 kb upstream, the transcript itself (each transcript is divided into eight equal sections regardless of length), and 5 kb downstream.

    We found that the MLV integrations distributed evenly upstream and downstream of the transcriptional start site (Fig. 1A). This is very different from HIV-1 integrations, which highly favor the entire length of the transcriptional regions but not the regions upstream of the transcriptional start (Fig. 1B). We do not observe any preferences for the regions just downstream of the RefSeq transcripts for either MLV or HIV-1 integrations (Fig. 1B).

    Regions of the genome enriched for the dinucleotide CpG, called CpG islands, are thought to be commonly associated with the transcriptional start sites in the vertebrate genome (17, 18). We found that 16.8% (152/903) of the MLV integrations landed in the region ±1 kb from the 27,704 documented human CpG islands, which is eight times higher than the value of 2.1% for random integrations. However, only 2.1% of HIV-1 integrations landed in the region ±1 kb from the same CpG islands.

    To determine whether MLV-targeted genes are transcriptionally active in HeLa cells, we used a publicly available gene-expression database (19). Two independent sets of microarray data based on HeLa cell mRNA were analyzed (GSM2145 and GSM2177). Of the 196 integrations that were within ±5 kb of transcription start sites of RefSeq genes, 79 were represented on the arrays. The median expression level for these 79 genes was about 1.8-fold higher than that of all the genes on the arrays (1911/1288 in GSM2145 and 1052/487 in GSM2177, Mann-Whitney test, P < 0.0001). We confirmed this with two additional sets of unpublished microarray data on HeLa cells (20). These data suggest that MLV prefers to integrate into the transcription start sites of more actively transcribed genes.

    In contrast to the findings by Schroder et al. (12), we did not observe any integration hot spots for HIV-1, nor did we find any for MLV integrations. This may be explained by our HIV-1 data set not being large enough to observe such a hot spot, or possibly HIV-1–based viruses behave differently in the cell lines we used. Apparently MLV does not have an obvious local preference at the resolution of 903 integrations in HeLa cells.

    There are several advantages to the method we used for mapping integrations. Our study required no selection for specific phenotypes, such as antibiotic resistance, that could bias the sample. The linker-based amplification is simple and rapid, and, by using a frequently cutting enzyme like MseI, the amplicons are small (70 bp on average), helping to avoid possible amplification biases. Much of our data was generated with the use of recombinant retroviruses that are not able to replicate after infection; thus, we are only looking at the initial integration events and not subsequent reinfections. In addition, because we are looking at the MLV proviral integrations in a human cell line, this method helps us to evaluate more carefully any potential dangers that may come from the use of MLV for gene-therapy treatments on humans.

    The overall MLV preference for the region surrounding the transcriptional start site and a preference for only the transcribed region of genes for HIV-1 integration in the human genome provide insights into the mechanisms of proviral integrations and important implications for their use as gene-therapy vectors. The different integration profiles suggest that there are fundamental mechanistic differences influencing site preferences for the two viruses, perhaps indicating specific interactions each virus has with different cellular cofactors. It also suggests that the risk factors for the use of MLV- and HIV-1–based vectors for gene therapy will not be identical. The two children that developed leukemia in the MLV trials both had an integration near the oncogene LMO2 (1, 2). We observed a preference for actively transcribed genes with MLV; however, even assuming no bias for individual genes, the data are troubling. In the X-linked severe combined immune deficiency syndrome clinical trials, >5 × 106 cells with MLV integrations were injected into each child (21, 22). Assuming that 20% of integrations are near transcriptional start sites, there will be 1 million integrations distributed among the 18,214 RefSeq genes or an average of 55 integrations into the 5′ region of the LMO2 locus per treatment. Evaluation of the sites of integration of HIV-1–based vectors compared to those of MLV vectors will be necessary to fully understand the risk factors and advantages of different retroviral gene-therapy systems.

    Supporting Online Material

    Materials and Methods

    SOM Text

    Table S1

    Fig. S1

    References and Notes

    References and Notes

    View Abstract

    Navigate This Article