Research Article

Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome

See allHide authors and affiliations

Science  30 May 2003:
Vol. 300, Issue 5624, pp. 1394-1399
DOI: 10.1126/science.1085952

Abstract

In March 2003, a novel coronavirus (SARS-CoV) was discovered in association with cases of severe acute respiratorysyndrome (SARS). The sequence of the complete genome of SARS-CoV was determined, and the initial characterization of the viral genome is presented in this report. The genome of SARS-CoV is 29,727 nucleotides in length and has 11 open reading frames, and its genome organization is similar to that of other coronaviruses. Phylogenetic analyses and sequence comparisons showed that SARS-CoV is not closelyrelated to anyof the previouslycharacterized coronaviruses.

Several hundred cases of severe atypical pneumonia of unknown etiology were reported in Guangdong Province of the People's Republic of China beginning in late 2002. After similar cases were detected in patients in Hong Kong, Vietnam, and Canada during February and March 2003, the World Health Organization (WHO) issued a global alert for the illness, designated “severe acute respiratory syndrome” (SARS). In mid-March 2003, SARS was recognized in health care workers and household members who had cared for patients with severe respiratory illness in Hong Kong and Vietnam. Many of these cases could be traced through multiple chains of transmission to a health care worker from Guangdong Province who visited Hong Kong, where he was hospitalized with pneumonia and died. By late April 2003, over 4300 SARS cases and 250 SARS-related deaths were reported to WHO from over 25 countries around the world. Most of these cases occurred after exposure to SARS patients in household or health care settings. The incubation period for the disease is usually from 2 to 7 days. Infection is usually characterized by fever, which is followed a few days later by a dry nonproductive cough and shortness of breath. Death from progressive respiratory failure occurs in about 3% to nearly 10% of cases (14).

In response to this outbreak, WHO coordinated an international collaboration that included clinical, epidemiologic, and laboratory investigations, and initiated efforts to control the spread of SARS. Attempts to identify the etiology of the SARS outbreak were successful during the third week of March 2003, when laboratories in the United States, Canada, Germany, and Hong Kong isolated a novel coronavirus (SARS-CoV) from SARS patients. Unlike other human coronaviruses, it was possible to isolate SARS-CoV in Vero cells. Evidence of SARS-CoV infection has now been documented in SARS patients throughout the world. SARS-CoV RNA has frequently been detected in respiratory specimens, and convalescent-phase serum specimens from SARS patients contain antibodies that react with SARS-CoV. There is strong evidence that this new virus is etiologically linked to the outbreak of SARS (57).

The coronaviruses (order Nidovirales, family Coronaviridae, genus Coronavirus) are a diverse group of large, enveloped, positive-stranded RNA viruses that cause respiratory and enteric diseases in humans and other animals. At ∼30,000 nucleotides (nt), their genome is the largest found in any of the RNA viruses. There are three groups of coronaviruses; groups 1 and 2 contain mammalian viruses, whereas group 3 contains only avian viruses. Within each group, coronaviruses are classified into distinct species by host range, antigenic relationships, and genomic organization. Coronaviruses typically have narrow host ranges and are fastidious in cell culture. The viruses can cause severe disease in many animals; and several viruses, including infectious bronchitis virus, feline infectious peritonitis virus, and transmissible gastroenteritis virus, are important veterinary pathogens. Human coronaviruses (HCoVs) are found in both group 1 (HCoV-229E) and group 2 (HCoV-OC43) and are responsible for ∼30% of mild upper respiratory tract illnesses (810).

Sequence analysis of a limited region of the replicase (rep) gene suggested that SARS-CoV was distinct from all other coronaviruses (57). In this report, we compare the sequence of the entire genome of SARS-CoV (Urbani strain) to the genomic sequences of other coronaviruses.

Genome organization. The sequence of the entire genome of SARS-CoV (GenBank accession number AY278741) was obtained by several approaches (11). During completion of this manuscript, other laboratories determined the genomic sequences of three additional strains of SARS-CoV. These nucleotide sequences vary at only 24 positions (table S3).

The genome of SARS-CoV is a 29,727-nucleotide, polyadenylated RNA, and 41% of the residues are G or C (the range for published complete coronavirus genome sequences is 37 to 42%). The genomic organization is typical of coronaviruses, having the characteristic gene order [5′-replicase (rep), spike (S), envelope (E), membrane (M), and nucleocapsid (N)-3′] and short untranslated regions at both termini (Fig. 1A and table S1). The SARS-CoV rep gene, which comprises approximately two-thirds of the genome, is predicted to encode two polyproteins (encoded by ORF1a and ORF1b) that undergo cotranslational proteolytic processing. There are four open reading frames (ORFs) downstream of rep that are predicted to encode the structural proteins S, E, M, and N, which are common to all known coronaviruses. The gene encoding hemagglutinin-esterase, which is present between ORF1b and S in group 2 and some group 3 coronaviruses (8), was not found.

Fig. 1.

Genome organization and mRNA mapping of SARS-CoV. (A) Overall organization of the 29,727-nt SARS-CoV genomic RNA. The 72-nt leader sequence is represented by a small orange square at the 5′ terminus of the genome and the subgenomic mRNAs (below). Predicted ORFs 1a and 1b, encoding the nonstructural polyproteins, and those encoding the S, E, M, and N structural proteins are indicated. The vertical position of the boxes indicates the phase of the reading frame. (B) Expanded view of the structural protein coding region and predicted mRNA transcripts. Known structural protein coding regions (blue boxes) and reading frames X1 to X5, encoding potential nonstructural proteins longer than 50 amino acids (gray boxes), are indicated. Lengths and map locations of the 3′-coterminal mRNAs, as predicted by identification of conserved transcription-regulating sequences, are indicated. (C) Northern blot analysis of SARS-CoV mRNAs. Poly(A)+ RNA was separated on a formaldehyde-agarose gel, transferred to a nylon membrane, and hybridized with a digoxigenin-labeled riboprobe overlapping the 3′ untranslated region. Signals were visualized by chemiluminescence. Sizes of the SARS-CoV mRNAs were calculated by interpolation from a log-linear fit of those of the molecular mass marker. Lane 1, SARS-CoV mRNA; lane 2, Vero E6 cell mRNA; lane 3, molecular mass marker (sizes in kilobases).

Coronaviruses also encode a number of nonstructural proteins that are located between S and E, between M and N, or downstream of N. These nonstructural proteins, which vary widely among the different coronavirus species, are of unknown function and are dispensable for virus replication (8). The genome of SARS-CoV contains ORFs for five potential nonstructural proteins that are more than 50 amino acids long in these intergenic regions (Fig. 1B, Table 1, and table S1). Two overlapping ORFs encoding predicted proteins of 274 and 154 amino acids (termed X1 and X2, respectively) are located between S and E. Three additional potential nonstructural genes, X3, X4, and X5 (encoding proteins of 63, 122, and 84 amino acids, respectively), are located between M and N. In addition to the five ORFs encoding the predicted nonstructural proteins described above, there are also two smaller ORFs between M and N, encoding predicted proteins of less than 50 amino acids (Table 1). Searches of the GenBank database (with BLAST and FastA) indicated that there is no significant sequence similarity between these potential nonstructural proteins of SARS-CoV and any other proteins (12). Note that there are ORFs encoding predicted proteins more than 50 amino acids long in the structural genes of SARS-CoV (such as N, S, and rep). Many short ORFs are present in the structural genes. They are unlikely to be expressed and, for simplicity, they are not shown in Fig. 1.

Table 1.

Classification of ORFs encoding potential nonstructural proteins of SARS-CoV. [The table shows the differences in nomenclature used to describe ORFs encoding potential nonstructural proteins of SARS-CoV in this report and in the report by Marra et al. (30). These differences are in nomenclature only, and the seven nt sequence differences between these strains do not change the position or number of ORFs (table S2). Because the complete SARS-CoV sequences have been available for only a few weeks and will probably be analyzed in great detail in the upcoming months, any nomenclature proposed at this time should be considered preliminary. The nomenclature used for the nonstructural proteins X1 to X5 is expected to be clarified once experiments on the transcriptional expression of the SARS-CoV genome are reported.]

Genome location (nt)View inline Protein (number of amino acids) This reportView inline Marra et al. (View inline)View inline
25,268 to 26,089 274 X1 ORF3
25,689 to 26,150 154 X2 ORF4
27,074 to 27,262 63 X3 ORF7
27,273 to 27,638 122 X4 ORF8
27,638 to 27,769 44 <50 amino acids ORF9
27,779 to 27,895 39 <50 amino acids ORF10
27,864 to 28,115 84 X5 ORF11
28,130 to 28,423View inline 98 See text ORF13
28,583 to 28,792View inline 70 See text ORF14
  • View inline* Based on the sequence of the Urbani strain of SARS-CoV (GenBank accession no. AY278741.1).

  • View inline In this report, the ORFs encoding the predicted nonstructural proteins are designated as X1 to X5 and are numbered sequentially beginning at the 5′ terminus of the genome. Only ORFs encoding for predicted proteins longer than 50 amino acids are included in Fig. 1B. The locations and sizes of the ORFs encoding the predicted replicase protein, structural proteins, and nonstructural proteins are shown in table S2.

  • View inline In Marra et al. (30), all of the ORFs, including those encoding the predicted replicase protein and structural proteins, are numbered sequentially from the 5′ terminus of the genome. This table shows only ORFs encoding predicted nonstructural proteins.

  • View inline§ These ORFs overlap the coding region of the N protein.

  • The coronavirus rep gene products are translated from genomic RNA, but the remaining viral proteins are translated from subgenomic mRNAs that form a 3′-coterminal nested set, each with a 5′ end derived from the genomic 5′ leader sequence. The coronavirus subgenomic mRNAs are synthesized through a discontinuous transcription process, the mechanism of which has not been unequivocally established (8, 13). The SARS-CoV leader sequence was mapped by comparing the sequence of 5′ RACE (rapid amplification of cDNA ends) (11) products synthesized from the N gene mRNA with those synthesized from genomic RNA. A sequence, AAACGAAC (genomic nucleotides 65 to 72), was identified immediately upstream of the site where the N gene mRNA and genomic sequences diverged. This sequence was also present upstream of ORF1a and immediately upstream of five other ORFs (Fig. 1, A and B, and table S1), suggesting that it functions as the conserved core of the transcription-regulating sequences (TRSs). The nucleotides required for TRS function must be identified experimentally.

    The favored model for production of subgenomic mRNAs of coronaviruses proposes that discontinuous transcription occurs during synthesis of the negative strand (13). Subgenomic negative strands containing a complementary copy of the leader sequence at their 3′ termini serve as templates for synthesis of subgenomic mRNAs. In addition to the site at the 5′ terminus of the genome, the TRS conserved core sequence appears six times in the remainder of the genome. The positions of the TRS in the genome of SARS-CoV predict that subgenomic mRNAs of 8.3, 4.5, 3.4, 2.5, 2.0, and 1.7 kb, not including the poly(A) tail, should be produced (Fig. 1, A and B, and table S1). At least five subgenomic mRNAs were detected by Northern hybridization of RNA from SARS-CoV–infected cells, using a probe derived from the 3′ untranslated region (Fig. 1C). The calculated sizes of the five predominant bands correspond to the sizes of five of the predicted subgenomic mRNAs of SARS-CoV; we cannot exclude the possibility that other, low-abundance mRNAs are present. Full-length genomic RNA was not detected, probably because it is the least prevalent viral RNA in infected cells (8). The predicted 2.0-kb transcript was also not detected, which suggests that the consensus TRS at nt 27,771 to 27,778 is not used or that it is a low-abundance mRNA. By analogy with other coronaviruses (8), the 8.3-kb and 1.7-kb subgenomic mRNAs are predicted to be monocistronic, directing translation of S and N, respectively, whereas multiple proteins could be translated from the 4.5-kb (X1, X2, and E), 3.4-kb (M and X3), and 2.5-kb (X4 and X5) mRNAs. A consensus TRS is not found directly upstream of the ORF encoding the predicted E protein (14), and a monocistronic mRNA that would be predicted to code for E could not be clearly identified by Northern blot analysis. It is possible that the 3.4-kb band contained more than one mRNA species that were not resolved in the gel or that the monocistronic mRNA for E is a low-abundance message. Also, in some coronaviruses, the E protein is translated from the second ORF on a polycistronic mRNA (15, 16).

    Phylogenetic analyses of the sequence of SARS-CoV. To determine the relationship between SARS-CoV and the previously characterized coronaviruses, we compared the predicted amino acid sequences for three well-defined enzymatic proteins encoded by the rep gene and the four major structural proteins of SARS-CoV with those from representative viruses for each of the species of coronavirus for which complete genomic sequence information was available (Fig. 2). The topologies of the resulting phylograms are remarkably similar (Fig. 2A). For each protein analyzed, the species formed monophyletic clusters consistent with the established taxonomic groups. In all cases, SARS-CoV sequences segregated into a fourth, well-resolved branch. These clusters were supported by bootstrap values above 90% [1000 replicates (17)]. Consistent with pairwise comparisons between the previously characterized coronavirus species (Fig. 2B), there was greater sequence conservation in the enzymatic proteins [3CLpro, polymerase (POL), and helicase (HEL)] than among the structural proteins (S, E, M, and N). These results indicate that SARS-CoV is not closely related to any of the previously characterized coronaviruses and forms a distinct group within the genus Coronavirus. SARS-CoV is approximately equidistant from all previously characterized coronaviruses, just as the existing groups are from one another. Detailed pairwise comparison by dot-plot analysis identified many regions of amino acid conservation within each protein (fig. S1), but the overall level of similarity between SARS-CoV and the other coronaviruses was low (Fig. 2B). No evidence for recombination was detected when the predicted protein sequences were analyzed with the program Sim-Plot (17, 18).

    Fig. 2.

    Phylogenetic analysis and pairwise identities of coronavirus proteins. Predicted amino acid sequences of SARS-CoV proteins were compared with those from reference viruses representing each species in the three groups of coronaviruses for which complete genomic sequence information was available [group 1(G1): human coronavirus 229E (HCoV-229E), af304460; porcine epidemic diarrhea virus (PEDV), af353511; transmissible gastroenteritis virus (TGEV), aj271965. Group 2 (G2): bovine coronavirus (BCoV), af220295; murine hepatitis virus (MHV), af201929. Group 3 (G3): infectious bronchitis virus (IBV), m95169]. Sequences for representative strains of other coronavirus species, for which partial sequence information was available, were included for some of the structural protein comparisons [group 1: canine coronavirus (CCoV), d13096; feline coronavirus (FCoV), ay204704; porcine respiratory coronavirus (PRCoV), z24675. Group 2: human coronavirus OC43 (HCoV-OC43), m76373, l14643, m93390; porcine hemagglutinating encephalomyelitis virus (HEV), ay078417; rat coronavirus (RtCoV), af207551]. (A) Sequence alignments and neighbor-joining trees were generated by the use of ClustalX 1.83 with the Gonnet protein comparison matrix. The resulting trees were adjusted for final output with treetool 2.0.1. (B) Uncorrected pairwise distances were calculated from the aligned sequences with the Distances program from the Wisconsin Sequence Analysis Package, version 10.2 (Accelrys, Burlington, MA). Distances were converted to percent identity by subtracting from 100. aa, amino acid.

    Predicted replicase gene products of SARS-CoV. Coronaviruses encode a chymotrypsin-like protease, 3CLpro, that is analogous to the main picornaviral protease 3Cpro (19). They also encode one (group 3) or two (groups 1 and 2) papain-like proteases, termed PLP1pro and PLP2pro, which are analogous to the foot-and-mouth disease virus leader protease Lpro. Overall, gene products of ORF1a are poorly conserved among different coronaviruses, except for these protease sequences (fig. S1). The predicted gene product of ORF1a of SARS-CoV appears to contain only one PLPpro domain at amino acids 1632 to 1847. The 3CLpro catalytic histidine and cysteine residues are fully conserved among all coronaviruses (SARS-CoV amino acids His3281 and Cys3385), but coronaviruses appear to lack the conserved catalytic acidic residue that is characteristic of other 3C-like proteases (19). The coronavirus replicase polyprotein is synthesized by a –1 ribosomal frameshift at a conserved “slippery” site (UUUAAAC) immediately upstream of a pseudoknot structure in the overlap of ORF1a and ORF1b. This polyprotein is autocatalytically processed to yield the mature viral proteases PLPpro and 3CLpro, the RNA-dependent polymerase (POL), the RNA helicase (HEL), and other proteins whose functions have not been well characterized. The predicted ribosomal frame shift at the SARS-CoV slippery site (nt 13,392 to 13,398) would result in translation of 7073 amino acids from a single start site.

    Analysis of the predicted structural proteins of SARS-CoV. The structural proteins of coronaviruses (S, E, M, and N) function during host cell entry and virion morphogenesis and release (20). During virion assembly, N binds to a defined packaging signal on viral RNA, leading to the formation of the helical nucleocapsid. M is localized at specialized intracellular membrane structures, and interactions between the M and E proteins and nucleocapsids result in budding through the membrane. In some group 2 coronaviruses, the C terminus of M interacts with the nucleocapsid to form a core structure (21). The S protein is incorporated into the viral envelope, again by interaction with M, and mature virions are released from smooth vesicles (22). Bands corresponding to the predicted N and S proteins of SARS-CoV were visible in preparations of purified virions that were analyzed by SDS–polyacrylamide gel electrophoresis; however, the assignment of other proteins in virions awaits the availability of specific antibodies to identify these viral proteins (fig. S4).

    The S proteins of coronaviruses are large type-I membrane glycoproteins that are responsible both for binding to receptors on host cells and for membrane fusion. The S proteins of some coronaviruses are cleaved into S1 and S2 subunits. S proteins also contain important virus-neutralizing epitopes, and amino acid changes in the S proteins can dramatically affect the virulence and in vitro host cell tropism of the virus (23, 24). Because of the low level of similarity (20 to 27% pairwise amino acid identity) between the predicted amino acid sequence of the S protein of SARS-CoV and the S proteins of other coronaviruses (Fig. 2B and fig. S1A), the comparison of primary amino acid sequences does not provide insight into the receptor-binding specificity or antigenic properties of SARS-CoV.

    The S protein of SARS-CoV has 23 potential N-linked glycosylation sites (table S2). Functional motifs at the amino (N) and carboxyl (C) termini of the S protein that are conserved among the coronaviruses are also present in the predicted SARS-CoV S protein, although the S2 domain is more conserved than the S1 domain. The N terminus of the SARS-CoV S protein contains a short type-I signal sequence composed of hydrophobic amino acids that are presumably removed during cotranslational transport through the endoplasmic reticulum. The C terminus, consisting of a transmembrane domain and a cytoplasmic tail rich in cysteine residues, is highly conserved in SARS-CoV (Fig. 3). At 52 amino acids in length, the SARS-CoV S protein is predicted to have the shortest transmembrane domain and cytoplasmic tail of any coronavirus analyzed (Fig. 3) (range, 61 to 74 amino acids).

    Fig. 3.

    Conserved motifs in coronavirus S proteins. Alignment of the C-terminal region of the SARS-CoV and reference coronavirus S proteins was generated with ClustalX 1.83. Residues that match the SARS-CoV sequence exactly are boxed. The membrane-spanning domain and cytoplasmic tails are delineated with arrows. The amino acid sequence Y(V/I)KWPW(Y/W)VWL (26) is a conserved motif in all three coronavirus groups. The cysteine-rich region, which overlaps the membrane-spanning region and the cytoplasmic region, is also found in all coronavirus groups.

    The current paradigm of protein-mediated membrane fusion proposes the collapse of alpha-amphipathic regions in the C half of the coronavirus S protein into coiled coils, thus bringing a fusion peptide toward the transmembrane domain, resulting in cellular and viral membrane fusion. Two or three alpha-amphipathic regions are predicted for the C half of coronavirus S proteins. An alpha-amphipathic region of 116 amino acids was predicted with high confidence at positions 884 to 999 of the SARS-CoV S protein (fig. S2). Syncytia formation, however, is not a prominent feature of SARS-CoV infection of Vero cells (5). The SARS-CoV S protein lacks the basic amino acid cleavage site found in group 2 and group 3 coronaviruses (25), suggesting that the SARS-CoV S protein is probably not cleaved into S1 and S2 subunits.

    Although overall sequence conservation is low (Fig. 2B), the predicted E, M, and N proteins of SARS-CoV contain conserved motifs that are found in other coronaviruses. Consistent with the E proteins of other coronaviruses, the predicted E protein of SARS-CoV contains a hydrophobic domain (residues 12 to 37) flanked by charged residues and followed by a cysteine-rich region. The N-terminal domains of coronavirus M proteins are exposed on the viral surface, whereas the C terminus is inside the viral membrane. Most coronavirus M proteins, including the predicted M protein of SARS-CoV, contain three hydrophobic transmembrane domains in the N-terminal half of the protein, although some viruses have four. A highly conserved amino acid sequence [SwWSFNPE (26)], immediately following the third hydrophobic domain, is SMWSFNPE in the SARS-CoV M protein. The M proteins of coronaviruses are invariably glycosylated near the N terminus. Group 1 and group 3 coronaviruses are N-glycosylated, whereas those of group 2 viruses are O-glycosylated (27, 28). The predicted M protein of SARS-CoV has an NGT near its N terminus, suggesting that this protein is N-glycosylated at position 4.

    The predicted N protein of SARS-CoV is a highly charged basic protein of 422 amino acids (range for other coronaviruses, 377 to 454) with seven successive hydrophobic residues near the middle of the protein. Although the overall amino acid sequence homology among coronavirus N proteins is low (Fig. 2B), a highly conserved motif [FYYL-GTGP (26)] occurs in the N-terminal half of all coronavirus N proteins, including that of SARS-CoV. Other conserved residues occur near this highly conserved motif (fig. S3).

    Conclusion. The completion of the genomic sequence of SARS-CoV provides a first look at the molecular characteristics of this virus and clearly demonstrates that this virus has features typical of a coronavirus, while it also has features that distinguish it from all previously sequenced coronaviruses. Relative to other coronaviruses, no significant major genomic rearrangements or any examples of large insertions or deletions in the genes coding for the replicase, S, E, M, or N proteins were found. Like some other coronaviruses, SARS-CoV has several small nonstructural ORFs that are found between the genes for S and E and between the genes for M and N. SARS-CoV is a novel virus that is phylogenetically distinct from other characterized coronaviruses. The genetic distance between SARS-CoV and any other coronavirus in all gene regions implies that no large part of the SARS-CoV genome was derived from other known viruses. The SARS-CoV genomic sequence does not provide obvious clues concerning the potential animal origins of this pathogen.

    The genome of SARS-CoV has several unique features that could be of biological significance. The short anchor of the S protein, the specific number and location of small ORFs, and the presence of only one copy of the PLPpro provide a combination of genetic features that readily differentiate this virus from previously described coronaviruses. Of course, the significance of any of these features remains to be determined experimentally.

    Successful control of the global SARS epidemic will require the development of vaccines and antiviral compounds that effectively prevent or treat this disease, as well as rapid and sensitive diagnostic tests to monitor its spread. The availability of complete genomic sequences (table S3) (29) of SARS-CoV in just a few weeks after the discovery of the virus should have an immediate impact on disease control efforts by making it possible to develop improved diagnostic tests, vaccines, and antiviral agents. The sequence information will also make it possible to identify the origin and natural reservoir of this virus and to contribute to studies of the immune response to this virus and the pathogenesis of SARS-CoV–related disease. The stage is set for the international scientific community to respond and to rapidly develop the tools to control this emerging infectious disease.

    Supporting Online Material

    www.sciencemag.org/cgi/content/full/1085952/DC1

    Materials and Methods

    Figs. S1 to S4

    Tables S1 to S3

    References

    References and Notes

    View Abstract

    Subjects

    Navigate This Article