Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape

See allHide authors and affiliations

Science  03 Feb 2021:
DOI: 10.1126/science.abf6950


Zoonotic pandemics, like that caused by SARS-CoV-2, can follow the spillover of animal viruses into highly susceptible human populations. Their descendants have adapted to the human host and evolved to evade immune pressure. Coronaviruses acquire substitutions more slowly than other RNA viruses, due to a proofreading polymerase. In the spike glycoprotein, we find recurrent deletions overcome this slow substitution rate. Deletion variants arise in diverse genetic and geographic backgrounds, transmit efficiently, and are present in novel lineages, including those of current global concern. They frequently occupy recurrent deletion regions (RDRs), which map to defined antibody epitopes. Deletions in RDRs confer resistance to neutralizing antibodies. By altering stretches of amino acids, deletions appear to accelerate SARS-CoV-2 antigenic evolution and may, more generally, drive adaptive evolution.

SARS-CoV-2 emerged from a yet-to-be defined animal reservoir and initiated a pandemic in 2020 (15). It has acquired limited adaptions, most notably the D614G substitution in the spike (S) glycoprotein (68). Humoral immunity to S glycoprotein appears to be the strongest correlate of protection (9) and recently approved vaccines deliver this antigen by immunization. Coronaviruses like SARS-CoV-2 slowly acquire substitutions due to a proofreading RNA dependent RNA polymerase (RdRp) (10, 11). Other emerging respiratory viruses have produced pandemics followed by endemic human-to-human spread. The latter is often contingent upon the introduction of antigenic novelty that enables reinfection of previously immune individuals. Whether SARS-CoV-2 S glycoprotein will evolve altered antigenicity, or specifically how it may change in response to immune pressure, remains unknown. We and others have reported the acquisition of deletions in the amino (N)-terminal domain (NTD) of the S glycoprotein during long-term infections of often-immunocompromised patients (1215). We have identified this as an evolutionary pattern defined by recurrent deletions that alter defined antibody epitopes. Unlike substitutions, deletions cannot be corrected by proofreading activity and this may accelerate adaptive evolution in SARS-CoV-2.

An immunocompromised cancer patient infected with SARS-CoV-2 was unable to clear the virus and succumbed to the infection 74 days after COVID-19 diagnosis (15). Treatment included Remdesivir, dexamethasone and two infusions of convalescent serum. We designate the individual as Pittsburgh long-term infection 1 (PLTI1). We consensus sequenced and cloned S genes directly from clinical material obtained 72 days following COVID-19 diagnosis and identified two variants with deletions in the NTD (Fig. 1A).

Fig. 1 Deletions in SARS-CoV-2 spike arise during persistent infections of immunosuppressed patients.

(A) Top. Sequences of viruses isolated from PLTI1 (PT) and viruses from patients with deletions in the same NTD region. Chromatograms are shown for sequences from PLTI1, which include sequencing of bulk reverse transcription products (CON) and individual cDNA clones. Bottom. Sequences from other long term infections from individuals AM (18) MA-JL (MA) (19) and a MSK cohort (M) with individuals 2, 3, 4, 6, 8, 11, 13 (13) Letters (A and B) designate different variants from the same patient. (B) Sequences of viruses from two patients with deletions in a different region of the NTD. All sequences are aligned to reference sequence (REF) MN985325 (WA-1). Genetic analysis of patient isolates is in fig. S1.

These data from PLTI1 and a similar report (12) prompted us to interrogate patient metadata sequences deposited in GISAID (16). In searching for similar viruses, we identified eight patients with deletions in the S glycoproteins of viruses sampled longitudinally over a period of weeks to months (Fig. 1A and fig. S1A). For each, early time points had intact S sequences and later time points had deletions within the S gene. Six had deletions that were identical to, overlapping with, or adjacent to those in PLTI1. Deletions at a second site were present in viruses isolated from two other patients (Fig. 1B), reports on these patients have since been published (13, 14). Viruses from all but one patient could be distinguished from one another by nucleotide differences present at both early and late time points (fig. S1B). On a tree of representative contemporaneously circulating isolates they form monophyletic clades making either a second community- or nosocomially-acquired infection unlikely (fig. S1C). The most parsimonious explanation is these deletions arose independently due to a common selective pressure to produce strikingly convergent outcomes.

We searched the GISAID sequence database (16) for additional instances of deletions within S glycoproteins. From a dataset of 146,795 sequences (deposited from 12/01/2019 to 10/24/2020) we identified 1,108 viruses with deletions in the S gene. When mapped to the S gene, 90% occupied four discrete sites within the NTD (Fig. 2A). We term these important sites recurrent deletion regions (RDRs), numbering them 1-4 from the 5′ to 3′ end of the S gene. Deletions identified in patient samples correspond to RDR2 (Fig. 1A) and RDR4 (Fig. 1B). Most deletions appear to have arisen and been retained in replication competent viruses. Without selective pressure, in-frame deletions should occur one third of the time. However, we observed a preponderance of in-frame deletions with lengths of 3, 6, 9 and 12 (Fig. 2B). Among all deletions, 93% are in frame and do not produce a stop codon (Fig. 2C). In the NTD, >97% of deletions maintain the open reading frame. Other S glycoprotein domains do not follow this trend e.g., deletions in the receptor binding domain (RBD) and S2 preserve the reading frame 30% and 37% of the time, respectively.

Fig. 2 Identification and characterization of recurrent deletion regions in SARS-CoV-2 spike protein.

(A) Positional quantification of deleted nucleotides in S among GISAID sequences. We designate the four clusters recurrent deletion regions (RDRs) 1-4.). (B) Length distribution of deletions. (C) The percentage of deletion events at the indicated site that either maintain the open reading frame or introduce a frameshift or premature stop codon (F.S./Stop). (D) Phylogenetic analysis of deletion variants (red branches) and genetically diverse non-deletion variants (black branches). Specific deletion clades/lineages are identified. Maximum likelihood phylogenetic trees, rooted on NC_045512, were calculated with 1000 bootstrap replicates. Trees with branch labels are in fig. S2. (E) Abundance of nucleotide deletions in each RDR. Positions are defined by reference sequence MN985325, by codon (top) and nucleotide (below).

To trace the origins of RDR variants, we produced phylogenies for each with 101 additional genomes that sample much of the genetic diversity within the pandemic (Fig. 2D). The RDR variants interleave with non-deletion sequences and occupy distinct branches, indicating their recurrent generation. This is most pronounced for RDRs 1, 2 and 4 but also true of RDR 3, with conservatively four independent instances. RDR variants form distinct lineages/branches, most prominently in RDR1 (lineage B.1.258) and suggest human-to-human transmission events. We verified, using sequences with sufficient metadata that explicitly differentiate individuals, the transmission of a variant within each RDR between people (fig. S2).

We defined the RDRs based upon peaks in the spectrum of S glycoprotein deletions. Deletion lengths and positions vary within RDRs 1, 2 and 4 (Fig. 2E). Variation is greatest in RDRs 2 and 4 with the loss of S glycoprotein residues 144/145 (adjacent tyrosine codons) in RDR2 and 243-244 in RDR4 appearing to be favored. In contrast, the loss of residues 69-70 accounts for the vast majority of RDR1 deletions. Based upon our phylogenetic analysis and supported by accompanying lineage classifications this two amino acid deletion has arisen independently at least thirteen times. RDR3 largely consists of three nucleotide (nt) deletions in codon 220.

We evaluated the genetic, geographic and temporal sampling of RDR variants (Fig. 3, A and B). This analysis is limited to sequences deposited in GISAID (16) where sequences from specific nations and regions are overrepresented e.g., United Kingdom and other European countries. We show the distribution of all sequences within the database for reference. For RDRs 2 and 4 the genetic and geographic distributions largely mirror those of reported sequences. Variants of RDRs 1 and 3 are strongly polarized to specific clades and geographies. This is likely the result of successful lineages, circulating in regions with strong sequencing initiatives. Our temporal analysis indicates that RDR variants have been present throughout the pandemic (Fig. 3C). Specific variant lineages like B.1.258 (Fig. 2D) harboring Δ69-70 in RDR1 have rapidly risen to notable abundance (Fig. 3D). Circulation of B.1.36 with RDR3 Δ210 accounts for most of the RDR3 examples (Figs. 2D and Fig. 3, C and D). The abundance of RDR2 Δ144/145 is explained by independent deletion events followed by transmission (Fig. 2D and Fig. 3, C and D).

Fig. 3 Geographic, genetic, and temporal abundance of RDR variants.

(A and B) Geographic (A) and genetic (B) distributions of RDR variants compared to the GISAID database (sequences from 12-1-2019 to 10-24-2020). GISAID clade classifications are used in (B). (C) Frequency of RDR variants among all complete genomes deposited in GISAID. (D) Frequency of specific RDR deletion variants (numbered according to spike amino acids) among all GISAID variants. The plot of RDR3/Δ210 has been adjusted by 0.02 units on the Y-axis for visualization in (C) due to its overlap with RDR2 and this adjustment has been retained in (D) to make direct comparisons between panels.

The recurrence and convergence of RDR deletions, particularly during long-term infections, is indicative of adaptation in response to a common selective pressure. RDRs 2 and 4 and RDRs 1 and 3 occupy two distinct surfaces on the S glycoprotein NTD (Fig. 4A). Both sites contain antibody epitopes (1719). The epitope for neutralizing antibody 4A8 is formed entirely by the beta sheets and extended connecting loops that harbor RDRs 2 and 4 (17). We generated a panel of S glycoprotein mutants representing the four RDRs to assess the impact deletions have on expression and antibody binding, we included an additional double mutant containing the deletions present in the B.1.1.7 variant of concern flagged initially in the United Kingdom. Cells were transfected with plasmids expressing these mutant glycoproteins and indirect immunofluorescence was used to determine if RDR deletions modulated 4A8 binding (Fig. 4B). Deletions at RDRs 1 and 3 had no impact on the binding of the monoclonal antibody, confirming that they alter independent sites. The three RDR2 deletions, the one RDR4 deletion and the double RDR1/2 deletions completely abolished binding of 4A8 while still allowing recognition by a monoclonal antibody targeting the RBD (Fig. 4B). Thus, convergent evolution operates in individual RDRs and between RDRs, exemplified by the same phenotype produced by deletions in RDR2 or RDR4.

Fig. 4 Deletions in the spike NTD alter its antigenicity; RDRs map to defined antigenic sites.

(A) Top: A structure of antibody 4A8 (17) (PDB: 7C21) (purples) bound to one protomer (green) of a SARS-CoV-2 spike trimer (grays). RDRs 1-4 are colored red, orange, blue, and yellow, respectively, and shown in spheres. The interaction site is shown at right. Bottom: The electron microscopy density of COV57 serum Fabs (18) (EMDB emd_22125) fit to SARS-CoV-2 S glycoprotein trimer (PDB: 7C21). The same view of the interaction site is provided at right. (B) S glycoprotein distribution in Vero E6 cells at 24 h post-transfection with S protein deletion mutants, visualized by indirect immunofluorescence in permeabilized cells. A monoclonal antibody against SARS-CoV-2 S protein receptor-binding domain (RBD MAb; red) detects all mutant forms of the protein (Δ69-70, Δ69-70+Δ141-144, Δ141-144, Δ144/145, Δ146, Δ210 and Δ243-244) and the unmodified protein (wild-type). 4A8 monoclonal antibody (4A8 MAb; green) does not detect mutants containing deletions in RDR2 or RDR4 (Δ69-70+Δ141-144, Δ141-144, Δ144/145, Δ146 and Δ243-244). Overlay images (RBD/4A8/DAPI) depict co-localization of the antibodies; nuclei were counterstained with DAPI (blue). The scale bars represent 100 μm. (C) Virus isolated from PLTI1 resists neutralization by 4A8. A non-deletion variant (Munich) is neutralized by 4A8, both are neutralized by convalescent serum and neither is neutralized by an influenza hemagglutinin binding antibody H2214 (29).

We assayed whether RDR variants escape the activity of a neutralizing antibody using the non-plaque purified viral population from PLTI1. This viral stock was completely resistant to neutralization by 4A8, while an isolate with authentic RDRs (20) was neutralized (Fig. 4C). We used a high titer neutralizing human convalescent polyclonal antiserum to demonstrate that both viral stocks could be neutralized efficiently. These data demonstrate that naturally arising and circulating variants of SARS-CoV-2 have altered antigenicity. We used a range of high, medium and low titer neutralizing human convalescent polyclonal antisera to assess if there was an appreciable difference in neutralization between the S glycoprotein-deleted and undeleted viruses. No major difference was observed suggesting that many more changes would be required to generate serologically distinct SARS-CoV-2 variants (table S1).

Coronaviruses, including SARS-CoV-2, have lower substitution rates than other RNA viruses due to an RdRp with proofreading activity (10, 11). However, proofreading cannot correct deletions. We find that adaptive evolution of S glycoprotein is augmented by a tolerance for deletions, particularly within RDRs. The RDRs occupy defined antibody epitopes within the NTD (1719) and deletions at multiple sites confer resistance to a neutralizing antibody. Deletions represent a generalizable mechanism through which S glycoprotein rapidly acquires genetic and antigenic novelty of SARS-CoV-2.

Fitness of RDR variants is evident by their representation in the consensus genomes from patients, transmission between individuals and presence in emergent lineages. Initially documented in the context of long-term infections of immunosuppressed patients, specific variants transmit efficiently between immunocompetent individuals. Characterization of unique cases led to the very early identification of RDR variants that are escape mutants. Since deletions are a product of replication, they will occur at a certain rate and variants are likely to emerge in otherwise healthy populations. Indeed, influenza explores variation that approximates future antigenic drift in immunosuppressed patients (21).

The RDRs occupy defined antibody epitopes within the S glycoprotein NTD. Selected in vivo, these deletion variants resist neutralization by monoclonal antibodies. Viruses cultured in vitro in the presence of immune serum have also acquired substitutions in RDR2 that confer neutralization resistance (22). Potent neutralizing responses and an array of monoclonal antibodies are directed to the RBD (18, 19, 23). A growing number of NTD directed antibodies have been identified (24, 25). Why antibody escape in nature is most evident in the NTD highlights a discrepancy and this requires further study.

Defining recurrent, convergent, patterns of adaptation can provide predictive potential. From viral sequences, we identify a pattern of deletions, contextualize their outcomes in protein structure and antibody epitope(s), and characterize their functional impact on antigenicity. During evaluation of this manuscript, multiple lineages with altered antigenicity and perhaps increased transmissibility have emerged and spread. These variants of global concern are RDR variants and include Mink Cluster 5 Δ69-70 (26), B.1.1.7 Δ69-70 and Δ144/145 (27), and B.1.351 Δ242-244 (28). Our analysis preceded the description of these lineages. We had demonstrated that identical/similar recurrent deletions that alter positions 144/145 and 243-244 in the S glycoprotein disrupt binding of antibody 4A8, which defines an immunodominant epitope within the NTD. Our survey for deletion variants captured the first representative of what would become the B.1.1.7 lineage. These real-world outcomes demonstrate the predictive potential of this and like approaches and show the need to monitor viral evolution carefully and continually.

Additional circulating RDR variants have gone virtually unnoticed. Are they intermediates on a pathway of immune evasion? That remains to be determined. However, deletions and substitutions within major NTD and RBD epitopes will likely continue to contribute to that process as they have already in current variants of concern. The progression of adaptations in immunocompromised patients and concerning SARS-CoV-2 variants alike remain to be resolved. Their evolution has thus far converged. Recurrence of adaptations in single patients and on global scales underscores the need to track and monitor deletion variants.

Supplementary Materials

Materials and Methods

Figs. S1 to S3

Tables S1 and S2

References (3034)

MDAR Reproducibility Checklist

This is an open-access article distributed under the terms of the Creative Commons Attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

References and Notes

Acknowledgments: We gratefully acknowledge the authors from the originating laboratories and the submitting laboratories, who generated and shared via GISAID genetic sequence data on which this research is based (table S2). We thank Stephen C. Harrison for his support. We thank Dr. Alison Morris, Dr. Bryan McVerry, Dr. Georgios Kitsios, Dr. Barbara Methe, Heather Michael, Michelle Busch, John Ries, and Caitlin Schaefer at the University of Pittsburgh, as well as the physicians, nurses, and respiratory therapists at the University of Pittsburgh Medical Center Shadyside-Presbyterian Hospital intensive care units for assistance with collection and processing of the endotracheal aspirate sample. This work was conducted under University of Pittsburgh Review Board approval CR19050099-009 Author contributions: K.R.M., L.J.R., S.N., L.R.R.M. and W.P.D. designed the experiments. K.R.M., L.J.R., S.N. and L.R.R.M. performed the experiments. K.R.M., L.J.R., S.N., L.R.R.M. and W.P.D. analyzed data. W.G.B. and G.H. provided reagents and samples K.R.M., L.J.R., S.N., L.R.R.M. and W.P.D. wrote the manuscript. Competing interests: The authors declare no competing interests. Funding: This work was supported by The University of Pittsburgh, the Center for Vaccine Research, The Richard King Mellon Foundation, the Hillman Family Foundation (WPD) and UPMC Immune Transplant and Therapy Center (WGB, GH). Data availability: Sequences from PLTI1 were deposited in NCBI GenBank under accession numbers MW269404 and MW269555. All other sequences are available via the GISAID SARS-CoV-2 sequence database (; see table S2 for full list of Acknowledgments). This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To view a copy of this license, visit This license does not apply to figures/photos/artwork or other content included in the article that is credited to a third party; obtain authorization from the rights holder before using such material.

Stay Connected to Science


Navigate This Article