Decoding Human Cytomegalovirus

See allHide authors and affiliations

Science  23 Nov 2012:
Vol. 338, Issue 6110, pp. 1088-1093
DOI: 10.1126/science.1227919


The human cytomegalovirus (HCMV) genome was sequenced 20 years ago. However, like those of other complex viruses, our understanding of its protein coding potential is far from complete. We used ribosome profiling and transcript analysis to experimentally define the HCMV translation products and follow their temporal expression. We identified hundreds of previously unidentified open reading frames and confirmed a fraction by means of mass spectrometry. We found that regulated use of alternative transcript start sites plays a broad role in enabling tight temporal control of HCMV protein expression and allowing multiple distinct polypeptides to be generated from a single genomic locus. Our results reveal an unanticipated complexity to the HCMV coding capacity and illustrate the role of regulated changes in transcript start sites in generating this complexity.

The herpesvirus human cytomegalovirus (HCMV) infects the majority of humanity, leading to severe disease in newborns and immunocompromised adults (1). The HCMV genome is ~240 kb with estimates of between 165 and 252 open reading frames (ORFs) (2, 3). These annotations likely do not capture the complexity of the HCMV proteome (4) because HCMV has a complex transcriptome (5, 6), and genomic regions studied in detail reveal noncanonical translational events, including regulatory (7) and overlapping ORFs (811). Defining the full set of translation products—both stable and unstable, the latter with potential regulatory/antigenic function (12)—is critical for understanding HCMV.

To identify the range of HCMV-translated ORFs and monitor their temporal expression, we infected human foreskin fibroblasts (HFFs) with the clinical HCMV strain Merlin and harvested cells at 5, 24, and 72 hours after infection using four approaches to generate libraries of ribosome-protected mRNA fragments (Fig. 1A and table S1). The first two measured the overall in vivo distribution of ribosomes on a given message; infected cells were either pretreated with the translation elongation inhibitor cycloheximide or, to exclude drug artifacts, lysed without drug pretreatment (no-drug). Additionally, cells were pretreated with harringtonine or lactimidomycin (LTM), two drugs with distinct mechanisms, which lead to strong accumulation of ribosomes at translation initiation sites and depletion of ribosomes over the body of the message (Fig. 1A) (1315). A modified RNA sequencing protocol allowed quantification of RNA levels as well as identification of 5′ transcript ends by generating a strong overrepresentation of fragments that start at the 5′ end of messages (fig. S1) (16).

Fig. 1

Ribosome profiling of HCMV-infected cells. (A) Ribosome occupancies after various treatments (illustrated to left); cycloheximide (CHX), no-drug, harringtonine (Harr), and LTM together with mRNA profiles of the UL25 gene at 72 hours after infection. An arrow marks the mRNA start. (B and C) Ribosome occupancy profiles for (B) UL38 and (C) UL10 genes that contain internal initiations. The gray area symbolizes a low-complexity region.

The ability of these approaches to provide a comprehensive view of gene organization is illustrated for the UL25 ORF: A single transcript start site is found upstream of the ORF (Fig. 1A, mRNA panel). Harringtonine and LTM mark a single translation initiation site at the first AUG downstream of the transcript start (Fig. 1A, Harr and LTM). Ribosome density accumulates over the ORF body ending at the first in-frame stop codon (Fig. 1A, CHX and no-drug). In the no-drug sample, excess ribosome density accumulates at the stop codon (Fig. 1A, no-drug) (14).

Examination of the full range of HCMV translation products, as reflected by the ribosome footprints, revealed many putative previously unidentified ORFs: internal ORFs lying within existing ORFs either in-frame, resulting in N-terminally truncated translation products (Fig. 1B), or out of frame, resulting in entirely previously unknown polypeptides (Fig. 1C); short uORFs (upstream ORFs) lying upstream of canonical ORFs (Fig. 2A); ORFs within transcripts antisense to canonical ORFs (Fig. 2B); and previously unidentified short ORFs encoded by distinct transcripts (Fig. 2C). For all of these categories, we also observed ORFs starting at near-cognate codons (codons differing from AUG by one nucleotide), especially CUG (Fig. 2D).

Fig. 2

Many ribosome footprints do not correspond to previously annotated ORFs. (A) Ribosome occupancy profiles for the leader region of UL139 gene. (B) Ribosome occupancy profiles of plus and minus strands (red and blue, respectively) for the UL91 gene. (C) mRNA and ribosome occupancy profiles for a previously unidentified short ORF. (D) Ribosome occupancies around a short ORF that initiates at a CUG codon. (E) Ribosome occupancy profiles for RNA β2.7. (Top) The annotated MS/MS spectra of two distinct peptides originating from ORFL6C and ORFL7C.

HCMV expresses several long RNAs lacking canonical ORFs, including β2.7, an abundant RNA, which inhibits apoptosis (17). In agreement with β2.7’s observed polysome association (18), multiple short ORFs are translated from this RNA (Fig. 2E and fig. S2), and the corresponding proteins for two of these ORFs were detected by means of high-resolution MS (Fig. 2E). Although the translation efficiency of these ORFs is low, four of them are highly conserved across HCMV strains (table S2). We found three similar polycistronic coding RNAs (including RNA1.2 and RNA4.9), and two short proteins encoded by these RNAs were confirmed with MS (fig. S3).

To define systematically the HCMV-translated ORFs using the ribosome profiling data, we first annotated HCMV splice junctions, identifying 88 splice sites (table S3). We then exploited the harringtonine-induced accumulation of ribosomes at translation start sites so as to identify ORFs using a support vector machine (SVM)–based machine learning strategy (14, 19). We observed a strong enrichment for AUG (33-fold) and near cognate codons in the translation initiation sites identified with this analysis (Fig. 3A). Visual inspection of the ribosome profiling data confirmed the SVM-identified ORFs and suggested an additional 53 putative ORFs (table S4). The large majority (86%) of the SVM-identified ORFs, and all of the manually identified ones, were identified by means of SVM analysis of an independent biological replicate (table S5 and fig. S4). The observed initiation sites were not caused by harringtonine because LTM treatment also induced ribosome accumulation at the vast majority (>98%) of these positions (Fig. 3B).

Fig. 3

Annotating the HCMV-translated ORFs. (A) Fold enrichment of AUG and near-cognate codons at predicted sites of translation initiation compared with their genomic distribution. (B) The ribosome footprints occupancy after LTM treatment at each start codon (relative to the median density across the gene) is depicted for the previously annotated ORFs (blue) and newly identified ORFs (red; empty red for ORFs that were removed). The occupancy at a codon five positions downstream of the start codon is depicted as a control (green). (C) Venn diagram summarizing the HCMV-translated ORFs. Fifty-three ORFs were initially identified through manual inspection. (D) The lengths distribution of newly identified ORFs (red) and previously annotated ORFs (blue). (E) Position of 30-nt ribosome footprints relative to the reading frame in the newly identified ORFs (red) and previously annotated ORFs (blue). (F) MRC-5 cells were mock-treated or infected with TB40-US33A-hemagglutinin (HA), and protein lysates were analyzed with Western blotting with indicated antibodies. (G) HeLa cells were transfected with GFP fusion proteins together with an ER marker (KDEL-mCherry) or stained with MitoTracker Red (Invitrogen, Grand Island) and imaged by means of confocal microscopy.

In total, we identified 751 translated ORFs that were supported by both the LTM and harringtonine data (tables S5 and S6 and file S1). The footprint density measurements for these ORFs were reproducible between biological replicates (figs. S5 and S6). Of these ORFs, 147 were previously suggested to be coding (Fig. 3C). We did not find strong evidence of translation for 24 previously annotated ORFs (table S7), although these proteins may well be expressed under different conditions.

Many newly identified ORFs are very short (245 ORFs ≤ 20 codons) (Fig. 3C) and are found upstream of longer ORFs. We also identified 239 short ORFs (21 to 80 codons) (Fig. 3D). Last, we identified 120 ORFs that are longer than 80 amino acids. These are primarily ORFs that contain splice junctions or alternative 5′ ends of previous annotations.

Several lines of evidence support the validity of the ORFs we identified. First, as seen for the previously annotated ORFs, newly identified ORFs showed a significant [P < 10−70; Kolmogorov-Smirnov (K-S) test] excess of ribosome footprints at the predicted stop codon (Fig. 1A and fig. S7). Because our ORF predictions were based on translation initiation sites found in the harringtonine and LTM samples, the observation that these accurately predicted downstream stop codons in an untreated sample provides independent support for our approach. Second, ribosome-protected footprints displayed a 3-nucleotide (nt) periodicity that was in phase with the predicted start site both globally (Fig. 3E) and in specific ORFs that contain internal out-of-frame ORFs (fig. S8). Third, brief inhibition of translation initiation using an eIF4A inhibitor Pateamine A (20) led to depletion of ribosome density from the body of the large majority of the predicted ORFs (fig. S9), indicating that the ribosomes were engaged in active elongation. The newly identified ORFs also exhibited a distribution of expression levels similar to that of previously annotated canonical ORFs (fig. S10). Last, many of the newly identified ORFs are conserved in other HCMV strains (table S2).

High-resolution tandem mass spectrometric measurements on virally infected cells by using stringent criteria and manual validation (files S2 and S3) (16, 21) unambiguously detected 53 previously unidentified proteins out of the 96 genomic loci that are not overlapping with annotated ORFs and contain at least one specific previously unidentified protein that is longer than 55 amino acids (table S8). For classes of new ORFs that were difficult to monitor with MS (truncated forms of longer proteins or short proteins), we used a tagging approach. For two N-terminally truncated proteins (derived from UL16 and UL38), we confirmed the appearance of alternative shorter transcripts and detected the expected full length and truncated tagged protein products (fig. S11). The truncated protein derived from UL16 was also observed in the context of the native virus (fig. S12), and we confirmed a splice variant of UL138 by using an antibody (fig. S12). For five short ORFs (including two initiated at near cognate start sites), we fused the ORFs in frame to a green fluorescent protein (GFP)–coding region in their otherwise native transcript context. We identified protein products of the expected sizes and confirmed that we correctly identified the translation start sites (fig. S13). We also showed that one of these short proteins (US33A-57aa), which was not identified with MS but was recently predicted by means of transcript analysis to be coding (6), is expressed in the context of the native virus (Fig. 3F and fig. S12). Additionally, we focused on the very short, near cognate driven uORFs that lie directly upstream of UL119 and US9, whose inclusion changes during infection as a result of changes in the 5′ end of the transcripts. We found that these uORFs modulated the translation efficiency of a downstream reporter gene (fig. S14).

Last, we examined the subcellular localization for 18 newly identified ORFs (11 of which were detected by means of mass spectrometry) (table S9) using transient expression of GFP-tagged proteins. We detected 15 proteins, 10 of which showed specific subcellular localization patterns: six in mitochondria, three in the endoplasmic reticulum (ER), and one in the nucleus (Fig. 3G and fig. S15). Immunoprecipation and MS experiments on two of these GFP-tagged proteins, ORF359W (ER localized) and US33A (mitochondrially localized), identified a few specific interacting proteins. Western blot analysis confirmed the interactions with TAP1 (ORF359W) and the mitochondrial inner membrane transport TIM machinery (US33A) (fig. S16).

HCMV genes are expressed in a temporally regulated cascade. Our data provides an opportunity to monitor viral protein translation throughout infection. Most of the viral genes, including newly identified ORFs, showed tight temporal regulation of protein synthesis levels; 82% of ORFs varied by at least fivefold. Hierarchical clustering of viral coding regions by their footprint densities during infection (a measure of the relative translation rates) revealed several distinct temporal expression patterns (fig. S17).

As was seen previously for a limited number of genomic loci (811, 22), examination of viral transcripts during infection revealed a pervasive use of alternative 5′ ends that is critical to the tight temporal regulation of viral genes expression and production of alternate protein products during infection. For example, at the US18-US20 locus, 5 hours after infection there is one main transcript starting just upstream of US20 enabling US20 translation. At 24 hours after infection, a shorter version of the transcript is detected starting immediately upstream of US18, enabling its translation. A third previously unknown transcript isoform starting within the US18 coding sequence emerges at 72 hours after infection, resulting in translation of a truncated version of US18 (ORFS346C.1) at this time point (Fig. 4, A and B). Another example is detailed in fig. S18, and we identified reproducible temporal regulation of 5′ ends in 61 viral loci (encompassing ~350 ORFs) (figs. S19 and S20 and table S10), six of which we confirmed with Northern blot analysis (Fig. 4B and figs. S11 and S21). Thus, our studies reveal a pervasive mode of viral gene regulation in which dynamic changes in 5′ ends of transcripts control protein expression from overlapping coding regions. Just as alternative splicing (a process in which a single gene codes for multiple proteins) expands protein diversity, alternative transcript start sites may provide a broadly used mechanism for generating complex proteomes.

Fig. 4

A major source of ORFs’ diversity during infection originates from alternative transcripts starts. (A) The mRNA and ribosome occupancy profiles around US18 to US20 loci at different infection times (marked left). Small arrows denote the different mRNA starts, and (top) the corresponding mRNAs are illustrated. (Bottom) An expanded view of the US18 locus at 72 hours after infection and includes the harringtonine and LTM profiles (asterisks indicate the internal initiation). (B) Total RNA extracted at different time points during infection was subjected to Northern blotting for ORFS346C.1.

The genomic era began with the sequencing of the bacterial DNA virus, phi X, in 1977 (23) and the mammalian DNA virus, Simian virus 40 (24), the following year. Since then, extraordinary advances in sequencing technology have enabled the determination of a vast array of viral genomes. Deciphering their protein coding potential, however, remains challenging. Here, we present an experimentally based analysis of translation of a complex DNA virus, HCMV, by using both next-generation sequencing and high-resolution proteomics. It is possible that many of the short ORFs we have identified are rapidly degraded and do not act as functional polypeptides. Nonetheless, these could still have regulatory function or be an important part of the immunological repertoire of the virus as major histocompatibility complex (MHC) class I bound peptides are generated at higher efficiency from rapidly degraded polypeptides (25). Our work yields a framework for studying HCMV by establishing the viral proteome and its temporal regulation, providing a context for mutational studies and revealing the full range of HCMV functional and antigenic potential.

Supplementary Materials

Materials and Methods

Figs. S1 to S22

Tables S1 to S10

Files S1 to S3

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: We thank O. Mandelboim, D. Wolf, M. Trilling, A. Lauring, S. Karniely, and Weissman lab members for critical reading of the manuscript; C. Chu for assistance with sequencing; and J. Pelletier for providing Pateamine A. N.S.-G. is supported by a human frontier science program postdoctoral fellowship. This work was supported by the Howard Hughes Medical Institute (J.S.W.) and the Max Planck Society (M.M.). The Gene Expression Omnibus accession number for the data is GSE41605.
View Abstract

Stay Connected to Science

Navigate This Article