## Abstract

Estimating an epidemic’s trajectory is crucial for developing public health responses to infectious diseases, but case data used for such estimation are confounded by variable testing practices. We show that the population distribution of viral loads observed under random or symptom-based surveillance, in the form of cycle threshold (Ct) values obtained from reverse-transcription quantitative polymerase chain reaction testing, changes during an epidemic. Thus, Ct values from even limited numbers of random samples can provide improved estimates of an epidemic’s trajectory. Combining data from multiple such samples improves the precision and robustness of such estimation. We apply our methods to Ct values from surveillance conducted during the SARS-CoV-2 pandemic in a variety of settings and offer alternative approaches for real-time estimates of epidemic trajectories for outbreak management and response.

Real-time tracking of the epidemic trajectory and infection incidence is fundamental for public health planning and intervention during a pandemic (*1*, *2*). In the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) pandemic, key epidemiological parameters such as the effective reproductive number, *R _{t}*, have typically been estimated using the time-series of observed case counts, hospitalizations, or deaths, usually based on reverse-transcription quantitative polymerase chain reaction (RT-qPCR) testing. However, limited testing capacities, changes in test availability over time, and reporting delays all influence the ability of routine testing to detect underlying changes in infection incidence (

*3*–

*5*). The question of whether changes in case counts at different times reflect epidemic dynamics or simply changes in testing have economic, health, and political ramifications.

RT-qPCR tests provide semiquantitative results in the form of cycle threshold (Ct) values, which are inversely correlated with log_{10} viral loads, but they are often reported only as binary “positives” or “negatives” (*6*, *7*). It is common when testing for other infectious diseases to use this quantification of sample viral load, for example, to identify individuals with higher clinical severity or transmissibility (*8*–*11*). For SARS-CoV-2, Ct values may be useful in clinical determinations about the need for isolation and quarantine (*7*, *12*), identifying the phase of an individual’s infection (*13*, *14*), and predicting disease severity (*14*, *15*). However, individual-level decision making based on Ct values has not been widely implemented owing to measurement variation across testing platforms and samples and limited understanding of SARS-CoV-2 viral kinetics in asymptomatic and presymptomatic infections. Although a single high Ct value may not guarantee a low viral load in one specimen, for example because of variable sample collection, measuring high Ct values in many samples will indicate a population with predominantly low viral loads. Cross-sectional distributions of Ct values should therefore represent viral loads in the underlying population over time, which may coincide with changes in the epidemic trajectory. For example, a systematic increase in the distribution of quantified Ct values has been noted alongside epidemic decline (*12*, *14*, *16*).

Here, we demonstrate that Ct values from single or successive cross-sectional samples of RT-qPCR data can be used to estimate the epidemic trajectory without requiring additional information from test positivity rates or serial case counts. We demonstrate that population-level changes in the distribution of observed Ct values can arise as an epidemiological phenomenon, with implications for interpreting RT-qPCR data over time in light of emerging SARS-CoV-2 variants. We additionally demonstrate how multiple cross-sectional samples can be combined to improve estimates of population incidence, a measure which is often elusive without serological surveillance studies. Collectively, we provide metrics for monitoring outbreaks in real-time—using Ct data that is collected but currently usually discarded—and our methods motivate the development of testing programs intended for outbreak surveillance.

## Relationship Between Observed Ct Values and Epidemic Dynamics

First, we show that the interaction of within-host viral kinetics and epidemic dynamics can drive changes in the distribution of Ct values over time—without a change in the underlying pathogen kinetics. That is, population-level changes in Ct value distributions can occur without systematic changes in underlying post-infection viral load trajectories at the individual level. To demonstrate the epidemiological link between transmission rate and measured viral loads or Ct values, we first simulated infections arising under a deterministic susceptible-exposed-infectious-recovered (SEIR) model (Fig. 1A and Materials and Methods: Epidemic Transmission Models). Parameters used are supplied in table S1. At selected testing days during the outbreak, simulated Ct values are observed from a random cross-sectional sample of the population using the Ct distribution model described in Materials and Methods: Ct Value Model and shown in figs. S1 and S2. By drawing simulated samples for testing from the population at specific time points, these simulations recreate realistic cross-sectional distributions of detectable viral loads across the course of an epidemic. Throughout, we assume everyone is infected at most once, ignoring re-infections as these appear to be a negligible portion of infections in the epidemic so far (*17*).

Early in the epidemic, infection incidence grows rapidly, and thus most infections arise from recent exposures. As the epidemic wanes, however, the average time elapsed since exposure among infected individuals increases as the rate of new infections decreases (Fig. 1, B and E) (*18*). This is analogous to the average age being lower in a growing versus declining population (*19*). Although infections are usually unobserved events, we can rely on an observable quantity, such as viral load, as a proxy for the time since infection. Since Ct values change asymmetrically over time within infected hosts (Fig. 1C) with peak viral load occurring early in the infection, random sampling of individuals during epidemic growth is more likely to sample recently infected individuals in the early phase of their infection and therefore with higher quantities of viral RNA. Conversely, randomly sampled infected individuals during epidemic decline are more likely to be in the later phase of infection, typically sampling lower quantities of viral RNA, although there is substantial sampling and viral load variability at all time points (Fig. 1D). The overall distribution of observed Ct values under randomized surveillance testing therefore changes over time, as measured by the median, quartiles, and skewness (Fig. 1G). While estimates for an individual’s time since infection based on a single Ct value will be highly uncertain, the population-level distribution of observed Ct values will vary with the growth rate, and therefore *R _{t}*, of new infections (Fig. 1, F and H).

To summarize this key observation in the context of classic results, we find that fast-growing epidemiologic populations (*R _{t}* > 1 and growth rate

*r*> 0) will have a predominance of new infections and thus of high viral loads, and shrinking epidemics (

*R*< 1,

_{t}*r*< 0) will have more older infections and thus low viral loads at a given cross section, where the relationship between

*R*and

_{t}*r*is modulated by the distribution of generation intervals (

*20*). Similar principles have been applied to serologic data to infer unobserved individual-level infection events (

*16*,

*21*–

*23*) and population-level parameters of infectious disease spread (

*21*,

*24*–

*28*).

We find that this phenomenon might also be present, though less pronounced, among Ct values obtained under symptom-based surveillance, where individuals are identified and tested following symptom onset. Similar to the case of random surveillance testing, Ct values obtained through testing recently symptomatic individuals are predicted to be lower (i.e., viral loads are higher) during epidemic growth than during epidemic decline (figs. S3 and S4). However, defining the exact nature and strength of this relationship will depend on a number of conditions being met (see fig. S4 caption).

By modeling the variation in observed Ct values arising from individual-level viral growth/clearance kinetics and sampling errors, the distribution of observed Ct values in a random sample becomes an estimable function of the times since infection, and the expected median and skewness of Ct values at a given point in time are then predictable from the epidemic growth rate. This function can then be used to estimate the epidemic growth rate from a set of observed Ct values. A relationship between Ct values and epidemic growth rate exists under most sampling strategies as described above, though calibrating the precise mapping is necessary to enable inference (e.g., using a different RT-qPCR; see fig. S5). This mapping can be confounded by testing biases arising, for example, from delays between infection and sample collection date when testing capacity is limited, or through systematic bias toward samples with higher viral loads such as those from severely ill individuals. Here, we focus on the case of random surveillance testing, where individuals are sampled at a random point in their infection course.

## Inferring the Epidemic Trajectory Using a Single Cross-Section

From these relationships, we derive a method to infer the epidemic growth rate given a single cross-section of randomly sampled RT-qPCR test results. The method combines two models: (1) the probability distribution of observed Ct values (and the probability of a negative result) conditional on the number of days between infection and sampling; and (2) the likelihood of being infected on a given day prior to the sample date. For (1), we use a Bayesian model and define priors for the mode and range of Ct values following infection based on the existing literature (Materials and Methods: Ct Value Model and Single Cross-Section Model). For (2), we initially develop two models to describe the probability of infection over time: (a) constant exponential growth of infection incidence; and (b) infections arising under an SEIR model. Both models provide estimates for the epidemic growth rate but make different assumptions regarding the possible shape of the outbreak trajectory: the exponential growth model assumes a constant growth rate over the preceding five weeks and requires few prior assumptions, whereas the SEIR model assumes that the growth rate changes daily depending on the remaining number of susceptible individuals, but requires more prior information.

To demonstrate the potential of this method with a single cross-section from a closed population, we first investigate how the distribution of Ct values and prevalence of PCR positivity changed over time in four well-observed Massachusetts long-term care facilities that underwent SARS-CoV-2 outbreaks in March and April 2020 (*29*). In each facility, we have the results of near-universal PCR testing of residents and staff from three time points after the outbreak began, including the number of positive samples, the Ct values of positive samples, and the number of negative samples (Materials and Methods: Long-Term Care Facilities Data). To benchmark our Ct value-based estimates of the epidemic trajectory, we first estimated the trajectory using a standard compartmental modeling approach fit to the measured point prevalences over time in each facility (Fig. 2A). Specifically, we fit a simple extended SEIR (SEEIRR) model, with additional exposed and recovered compartments describing the duration of PCR positivity (Materials and Methods: Epidemic Transmission Models), to the three observed point prevalence values from each facility. Because the testing was nearly universal, this approach provides a near ground truth of the epidemic trajectory against which we can evaluate the accuracy of the Ct value-based approaches. We call this the baseline estimate. Figure 2 shows results and data for one of the long-term care facilities, while figs. S6 and S7 show results for the other three.

As time passes, the distribution of observed Ct values at each time point in the long-term care facilities (Fig. 2B) shifts higher (lower viral loads) and becomes more left-skewed. We observed that these shifts tracked with the changing (i.e., declining) prevalence in the facilities. To assess if these changes in Ct value distributions indeed reflected underlying changes in the epidemic growth rate, we fit the exponential growth and simple SEIR models using the Ct likelihood to each individual cross-section of Ct values to get posterior distributions for the epidemic trajectory up to and at that point in time (Fig. 2C). Note that the only facility-specific data for each of these fits were the Ct values and number of negative tests from each single cross-sectional sample. Additional ancillary information included prior distributions for the epidemic seed time (after March 1) and the within-host virus kinetics. To assess the fit, we compare the predicted Ct distribution (Fig. 2B) and point prevalence (Fig. 2D) from each fit to the data and compare the growth rates from these fits to the baseline estimates. Posterior distributions of all Ct value model parameters are shown in fig. S8.

While both sets of results are fitted models and so neither can be considered the truth, we find that the Ct method fit to one cross-section of data provides a similar posterior median trajectory to the baseline estimate, which required three separate point prevalences with near-universal testing at each time point. In particular, the Ct-based models appear to accurately discern whether the samples were taken soon or long after peak infection incidence. Both methods were in agreement over the direction of the past average and recent daily growth rates (i.e., whether the epidemic is currently growing or declining, and whether the growth rate has dropped relative to the past average). The average growth rate estimates were very similar between the prevalence-only and Ct value models at most time points, though the daily growth rate appeared to decline earlier in the prevalence-only compartmental model. These estimates have a great deal of variability, however, and should be interpreted in that context. This is especially clear in fig. S7, where the other facilities exhibit more variability between estimates from the two methods. Overall, these results show that a single cross-section of Ct values can provide similar information to point prevalence estimates from three distinct sampling rounds, when the epidemic trajectory is constrained, as in a closed population.

To ensure that our method provides accurate estimates of the full epidemic curve, we performed extensive simulation-recovery experiments using a synthetic closed population undergoing a stochastic SEIR epidemic. Figure S9 shows the results of one such simulation, demonstrating the information gained from using a single cross-section of virological test data when attempting to estimate the true infection incidence curve at different points during an outbreak. We assess performance using various simulations, including a version of the method that uses only positive Ct values without information on the fraction positive, simulations of different population sizes, and estimation using prior distributions of decreasing strengths; details are in Supplementary Materials: Simulated Long-Term Care Facility Outbreaks and results in figs. S10 to S12.

Although no real long-term care facility data were available to assess the method’s accuracy during the early phase of the epidemic outbreak, the simulation experiments reveal that the method can be used at all stages of an epidemic. Furthermore, although there is a substantial uncertainty in the growth rate estimates, these analyses show that a single-cross section of data can be used to determine whether the epidemic has been recently increasing or decreasing. The posterior probability of growth versus decline can be used for this assessment, acting like a hypothesis test when the credible interval excludes zero or in a broader inferential way if it does not. Although this is a trivial result for SARS-CoV-2 incidence in many settings where cases, hospitalizations or deaths already provide a clear picture of epidemic growth or decline, for locations and future outbreaks where testing capacity is restricted, our results show that a single cross-sectional random sample of a few hundred tested individuals combined with reasonable priors (for example, constraining the epidemic seed time to within a one-to-two-month window) could be used to immediately estimate the stage of an outbreak. Moreover, this inferential method provides the basis for combining cross-sections for multiple testing days.

## Inferring the Epidemic Trajectory Using Multiple Cross-Sections

While a single cross-section of Ct values can reasonably estimate the trajectory of a simple outbreak represented by a compartmental model, more complex epidemic trajectories will require more cross-sections for proper estimation. Here, we extend our method to combine data from multiple cross-sections, allowing us to estimate the full epidemic trajectory more reliably (Materials and Methods: Multiple Cross-Sections Model and Markov Chain Monte Carlo Framework). In many settings, the epidemic trajectory is monitored using reported case counts. Limiting reported cases to those with positive test results, the daily number of new positives can be used to calculate *R _{t}* (

*3*). However, this approach can be obscured when the definition of a case changes during the course of an epidemic (

*30*). Furthermore, such data often represent the growth rate of positive tests, which can change dramatically based on changing test capacity rather than the incidence of infection, requiring careful monitoring and adjustments to account for changes in testing capacity, the delay between infection and test report date, and the conversion from prevalence to incidence. Death counts are also used to estimate the epidemic trajectory, but these are substantially delayed and the relationship between cases and deaths is not stable (

*31*). When, instead, Ct values from surveillance sampling are available, our methods can overcome these limitations by providing a direct mapping between the distribution of Ct values and infection incidence. While case count methods exhibit bias due to changing test rates (

*5*), our method provides a means to estimate

*R*using only one or a few surveillance samples, and this method can accommodate random sampling schemes that increase or decrease over time with test availability.

_{t}To demonstrate the performance of these Ct-based methods, we simulate outbreaks under a variety of testing schemes using SEIR-based simulations and sample Ct values from the outbreaks (Materials and Methods: Simulated Testing Schemes). We compare the performance of *R _{t}* estimation using reported case counts (based on the testing scheme) via the R package

*EpiNow2*(

*32*,

*33*), where reporting depends on testing capacity and the symptom status of infected individuals, to the performance of our methods when one, two, or three surveillance samples are available with observed Ct values, with a total of about 0.3% of the population sampled (3000 tests spread among the samples).

Figure 3 plots the posterior median *R _{t}* from each of the 100 simulations of each method when the epidemic is growing (day 60) and declining (day 88). Except when only one sample is used, the Ct-based methods fitting to an SEIR model exhibit minimal bias, even when the number of tests substantially changes across sample days. For the single sample estimates during the growth phase, the posterior median estimates are shifted above the true value because a range of

*R*values are consistent with the data—the prior density for

_{0}*R*is uniform between 1 and 10 with a median of 5.5, which weights the posterior median higher than the true value. Methods based on reported case counts, on the other hand, consistently exhibit noticeable upward bias when testing rates increase over the observed period and substantial downward bias when testing rates decrease. The Ct-based methods do exhibit higher variability, however. This is captured by the Bayesian inference model, as all of the Ct-based methods achieve at least nominal coverage of the 95% credible intervals among these 100 simulations (fig. S13).

_{0}An alternative approach to estimating *R _{t}* using case counts is to fit a standard compartmental model to the observed proportion of positive tests from a random sample. To demonstrate the value of incorporating Ct values rather than simply using positivity rates from a surveillance sample, we also compare the results to an SEIR model fit to point prevalence observed at the same sample times, assuming PCR positivity represents the infectious stage of the disease. In this alternate method, this misspecification of the SEIR model results in inaccurate

*R*estimates during the decline phase of the simulation (Fig. 3B). While a more accurate model might distinguish the infectious stage and duration of PCR positivity, as in the SEEIRR model, this simple model represents an approach that might be used to infer incidence changes from prevalence data in the absence of a quantified relationship between infection state and PCR positivity.

_{t}We also assessed the precision of our estimates using smaller sample sizes and different deployment of tests among testing days for a given sample size. These comparisons are shown in fig. S13, which also compares the Ct-based method to the positivity-based estimation. The Ct-based method performs well in many cases with sample sizes as low as 200–500 tests. When testing is stable, reported case counts provide a more precise estimate of the trajectory. However, a small number of tests (e.g., the same number of tests as used for one day of routine case detection) devoted to two or three surveillance samples can provide unbiased estimation when reported case counts may be biased.

## Reconstructing Complex Incidence Curves Using Ct Values

Simple epidemic models are useful to understand recent incidence trends when data are sparse or in relatively closed populations where the epidemic start time is approximately known (Supplementary Materials: Epidemic Seed Time Priors). In reality, however, the epidemic usually follows a more complex trajectory which is difficult to model parametrically. For example, the SEIR model does not account for the implementation/relaxation of non-pharmaceutical interventions and behavior changes that affect pathogen transmission unless explicitly specified in the model. For a more flexible approach to estimating the epidemic trajectory from multiple cross-sections, we developed a third model for infection incidence, using a Gaussian process (GP) prior for the underlying daily probabilities of infection (*34*). The GP method provides estimated daily infection probabilities without making strong assumptions about the epidemic trajectory, assuming only that infection probabilities on contemporaneous days are correlated, with decreasing correlation at increasing temporal distances (Supplementary Materials: Epidemic Transmission Models). Movie S1 demonstrates how estimates of the full epidemic trajectory, representing a simulation for the implementation and subsequent relaxation of non-pharmaceutical interventions, can be sequentially updated using this model as new samples become available over time. Movie S2 shows how the precision of the estimated epidemic curve decreases at smaller sample sizes, where 200 samples per week were sufficient to reliably track the epidemic curve. Movie S3 shows how estimation remains accurate if sampling is only initiated partway through the epidemic.

With the objective of reconstructing the entire incidence curve using routinely collected RT-qPCR data, we used anonymized Ct values from positive samples measured from near-universal testing of all hospital admissions and non-admitted ER patients in the Brigham and Women’s Hospital (BWH) in Boston, MA, between April 15 and November 10, 2020 (Materials and Methods: Brigham and Women’s Hospital Data). We aligned these with estimates for *R _{t}* based on case counts in Massachusetts (Fig. 4, A to C). The median and skewness of the detectable Ct distribution was correlated with

*R*(Fig. 4B), in line with our theoretical predictions (depicted in Fig. 1). The median Ct value rose (corresponding to a decline in median viral load) and skewness of the Ct distribution fell in the late spring and early summer, as shelter-in-place orders and other non-pharmaceutical interventions were rolled out (Fig. 4C), but the median declined and skewness rose in late summer and early fall as these measures were relaxed, coinciding with an increase in observed case counts for the state (Fig. 4A).

_{t}Using the observed Ct values, we estimated the daily growth rate of infections using the SEIR model on single cross-sections (Fig. 4D and figs. S14 and S15) and the full epidemic trajectory using the GP model (Fig. 4E and fig. S16). Similar temporal trends were inferred under both models (fig. S17), and the GP model provided growth rate estimates that followed those estimated using observed case counts (Fig. 4F). While these data are not strictly a random sample of the community, and the observed case counts do not necessarily provide a ground truth for the *R _{t}* value, these results demonstrate the ability of this method to re-create epidemic trajectories and estimate growth or decline of cases using only positive Ct values collected through routine testing. We assessed the robustness of the estimated GP trajectory to smaller sample sizes by refitting the model after subsampling different numbers of Ct values from the dataset (fig. S18). Interestingly, our estimated epidemic trajectory using only routinely generated Ct values from a single hospital was remarkably correlated with changes in community-level viral loads obtained from wastewater data (fig. S19) (

*35*).

## Discussion

The usefulness of Ct values for public health decision making is currently the subject of much discussion and debate. One unexplained observation which has been consistently observed in many locations is that the distribution of observed Ct values has varied over the course of the current SARS-CoV-2 pandemic, which has led to questions over whether the fitness of the virus has changed (*12*, *14*, *16*). Our results demonstrate that this can be explained as an epidemiologic phenomenon, without invoking any change in individual-level viral kinetics or testing practices. This method alone, however, cannot prove that is the case for any specific setting, as changing viral properties or changes in test availability may also lead to such shifts in Ct value distributions. We find that properties of the population-level Ct distribution strongly correlate with estimates for the effective reproductive number or growth rate in real-world settings, in line with our theoretical predictions.

Using quantitative diagnostic test results from multiple different tests conducted in a single cross-sectional survey, epidemic trends have previously been inferred from virological data (*18*). The methods we describe here use the phenomenon observed in the present pandemic and the relationship between incidence rate, time since infection, and virologic test results to estimate a community’s position in the epidemic curve, under various models of epidemic trajectories, based on data from one or more cross-sectional surveys using a single virologic test. Comparisons of simulated Ct values and observed Ct values with growth rates and *R _{t}* estimates validate this general approach. Despite the challenges of sampling variability, individual-level differences in viral kinetics, and the limitations of comparing results from different laboratories or instruments, our results demonstrate that RT-qPCR Ct values, with all of their variability for an individual, can be highly informative of population-level dynamics. This information is lost when measurements are reduced to binary positive or negative classifications, as has been the case through most of the SARS-CoV-2 pandemic.

Here, we focused on the case of randomly sampling individuals from the population. This method will therefore be most useful in settings where representative surveillance samples can be obtained independently of COVID-19 symptoms, such as the REACT study in England (*36*). Even relatively small cross-sectional surveys, for example in a given city, may be very useful for understanding the direction that an outbreak is heading. Standardized data collection and management across regions, along with wider use of random sampling, would further improve the usefulness of these methods, which demonstrate another use case for such surveillance (*37*, *38*). These methods allow municipalities to evaluate and monitor, in real-time, the role of various epidemic mitigation interventions, for example by conducting even a single or a small number of random virologic testing samples as part of surveillance rather than simply relying on routine testing results.

Extrapolation of these findings to Ct values obtained through strategies other than a population census or a mostly random sample requires additional considerations. When testing is based primarily on the presence of symptoms or contact tracing efforts, infected individuals are more likely to be sampled at specific times since infection, which will impact the distribution of measured Ct values. Further complications arise when the delay between infection or symptom onset and sample collection changes over the course of the epidemic, for example because of a strain on testing capacity. Nonetheless, our simulation results suggest that the epidemic trajectory can still influence Ct values measured under symptom-based surveillance, though the strength of this association will depend on a number of additional considerations as described in fig. S4. Additional work is needed to extend the inference methods presented here to use non-random surveillance samples.

The overall finding of a link between epidemic growth rates and measured Ct distributions is important for interpreting virologic data in light of emerging SARS-CoV-2 variants (*39*, *40*). When samples are obtained through population-wide testing, an association between lower Ct values and emerging variants can be partially explained by those variants having a higher growth rate with a preponderance of recent infections compared to pre-existing, declining variants. For example, a recent analysis of Ct values from P.1 and non-P.1 variant samples in Manaus, Brazil initially found that P.1 samples had significantly lower Ct values (*41*). However, after accounting for the time between symptom onset and sample collection date (where shorter delays should lead to lower Ct values), the significance of this difference was lost. We caution that this finding does not exclude the possibility of newer variants causing infections with higher viral loads; rather, it highlights the need for lines of evidence other than surveillance testing data.

These results are sensitive to the true distribution of observed viral loads each day after infection. Different swab types, sample types, instruments, or Ct thresholds may alter the variability in the Ct distribution (*15*, *16*, *42*, *43*), leading to different relationships between the specific Ct distribution and the epidemic trajectory. Where possible, setting-specific calibrations, for example based on a reference range of Ct values, will help to generate precise estimates. This method will be most useful in cases where the population-level viral load kinetics can be estimated, either through direct validation or by comparison with a reference standard, for the instruments and samples used in testing. Here, we generated a viral kinetics model based on observed properties of measured viral loads in the literature (proportion detectable over time following symptom onset, distribution of Ct values from positive specimens) and used these results to inform priors on key parameters when estimating growth rates. The growth rate estimates can therefore be improved by choosing more precise, accurate priors relevant to the observations used during model fitting. In cases where results come from multiple testing platforms, the model should either be adjusted to account for this by specifying a different distribution for each platform based on its properties or, if possible, the Ct values should be transformed to a common scale such as log viral copies. If these features of the tests change substantially over time, results incorporating multiple cross-sections might exhibit bias and will not be reliable.

Results could also be improved if individual-level features that may affect viral load, such as symptom status, age, and antiviral treatment, are available with the data and incorporated into the Ct value model (*14*–*16*, *44*, *45*). A similar approach may also be possible using serologic surveys, as an extension of work that relates time since infection to antibody titers for other infectious diseases (*27*, *28*). If multiple types of tests (e.g., antigen and PCR) are conducted at the same time, combining information could substantially reduce uncertainty in these estimates (*18*). If variant strains are associated with different viral load kinetics and become common (*40*, *46*), this should be incorporated into the model as well. Other features of the pathogen, such as the relationship between the viral loads of infector and infectee, might also affect population-level variability over time. Using virologic data as a source of surveillance information will require investment in better understanding Ct value distributions as new instruments and techniques come online and as variants emerge, and in rapidly characterizing these distributions for future emerging infectious diseases. Remaining uncertainty can be incorporated into the Bayesian prior distribution.

This method has several limitations. While the Bayesian framework incorporates the uncertainty in viral load distributions into inference on the growth rate, parametric assumptions and reasonably strong priors on these distributions aid in identifiability. If these parametric assumptions are violated, for example when SEIR models are used across time periods when interventions likely affected transmission rates, inference may not be reliable. In addition, the methods described here and the relationship between incidence and skewness of Ct distributions become less reliable when there are very few positive cases, so results should be interpreted with caution and sample sizes increased in periods with low incidence. In some cases, with one or a small number of cross-sections, the observed Ct distribution could plausibly result from all individuals very early in their infection at the start of fast epidemic growth, all during the recovery phase of their infection during epidemic decline, or a mixture of both (Fig. 4E and fig. S15). We therefore used a parallel tempering Markov chain Monte Carlo (MCMC) algorithm for the single cross-section estimates, which can accurately estimate these multimodal posterior distributions (*47*). Interpretation of the estimated median growth rate and credible intervals should be done with proper epidemiological context: estimated growth rates that are grossly incompatible with other data can be safely excluded.

This method may also overstate uncertainty in the viral load distributions if results from different machines or protocols are used simultaneously to inform the prior. A more precise understanding of the viral load kinetics—in particular, modeling these kinetics in a way that accounts for the epidemiologic and technical setting of the measurements—will help improve this approach and determine whether Ct distribution parameters from different settings are comparable. Because of this, semiquantitative measures from RT-qPCR should be reported regularly for SARS-CoV-2 cases and early assessment of pathogen load kinetics should be a priority for future emerging pathogens. The use of control measurements, like using the ratio of detected viral RNA to detected human RNA, could also improve the reliability and comparability of Ct measures.

The Ct value is a measurement with magnitude, which provides information on underlying viral dynamics. Although there are challenges to relying on single Ct values for individual-level decision making, the aggregation of many such measurements from a population contains substantial information. These results demonstrate how one or a small number of random virologic surveys can be best used for epidemic monitoring. Overall, population-level distributions of Ct values, and quantitative virologic data in general, can provide information on important epidemiologic questions of interest, even from a single cross-sectional survey. Better epidemic planning and more targeted epidemiological measures can then be implemented based on such a survey, or Ct values can be combined across repeated samples to maximize the use of available evidence.

## Materials and Methods

### Long-Term Care Facilities Data

Data from Massachusetts long-term care facilities were nasopharyngeal specimens collected from staff and residents processed at the Broad Institute of MIT and Harvard CRSP CLIA laboratory, with an FDA Emergency Use Authorized laboratory-developed assay. Ct values for N1 and N2 gene targets were provided along with sample collection date, a random tube ID, and a unique anonymized institute ID to reflect that specimens came from distinct institutions. The specimens used here originated in early 2020 when public health efforts in Massachusetts led to comprehensively serial testing senior nursing facilities as described previously (*29*). Swabs from those public health efforts were processed for clinical diagnostics. Sample collection dates ranged from April 6, 2020, to May 5, 2020, with each facility undergoing three sampling rounds. Each round took a median of 2 days (range 1–6 days) to complete. The anonymized Ct data was made available, and the N2 Ct values were used for these analyses. For all analyses presented here, sample collection dates were grouped into sampling rounds and analyzed based on the mean collection date for that round (i.e., the dates shown in Fig. 2 and figs. S6 and S7).

### Brigham and Women’s Hospital Data

Data from the Brigham and Women’s Hospital in Boston, Massachusetts, were nasopharyngeal specimens from patients processed on a Hologic Panther Fusion SARS-CoV-2 assay. Ct values for the ORF1ab gene were provided alongside sample collection date, with collection dates ranging from April 3, 2020, to November 10, 2020. For these analyses, we grouped samples by week of collection on the epidemiological calendar and used the mid-point of each week for the analyses shown in Fig. 4. Testing during the first two weeks in April 2020 was restricted to patients with symptoms consistent with COVID-19 and who needed hospital admission. Following April 15, testing criteria for this platform were expanded to include all asymptomatic hospital admissions, symptomatic patients in the emergency room who were not admitted to the hospital, and inpatients requiring testing who were not in labor. Symptomatic ER patients who were admitted to the hospital were tested on a different PCR platform and are not considered here. In the analyses presented here, we use only samples taken after April 15. While this is not a perfectly representative surveillance sample, the routine testing of hospital admissions who were not seeking COVID treatment creates a cohort that is less biased than symptom-based testing and represents the overall rise and fall of cases in the hospital’s catchment area. Daily data are aggregated by week. Daily confirmed case counts for Massachusetts were obtained from *The New York Times*, based on reports from state and local health agencies (*48*).

### Epidemic Transmission Models

Throughout these analyses, we used four mathematical models to describe daily SARS-CoV-2 transmission over the course of an epidemic. Full model descriptions are given in Supplementary Materials: Epidemic Transmission Models, and a brief overview is provided here in order of introduction in the main text. First, the *SEIR Model* is a compartmental model which assumes that the growth rate of new infections depends on the current prevalence of infectious and susceptible individuals by modeling the proportion of the population who are susceptible, exposed, infected, or recovered with respect to disease over time. Second, the *Exponential Growth Model* assumes that new infections arise under a constant exponential growth rate. Third, the *SEEIRR Model* is a modification of the SEIR model with additional compartments for individuals who are exposed but not yet detectable by PCR and individuals who are recovered but still detectable by PCR. Finally, the *Gaussian Process Model* describes the epidemic trajectory as a vector of daily infection probabilities, where a Gaussian process prior is used to ensure that daily infection probabilities are correlated in time; days that are chronologically close in time are more correlated than those that are chronologically distant.

### Ct Value Model

We developed a mathematical model describing the distribution of observed SARS-CoV-2 viral loads over time following infection. The model is described in full in Supplementary Materials: Ct Value Model. This model is similar to that used by Larremore *et al*. (*49*), but allows for more flexibility in the decline of viral load during recovery. We used a parametric model describing the modal Ct value, *C _{mode}*(

*a*), for an individual

*a*days after infection, represented by the solid black line in fig. S1B. The measured Ct value is a linear function of the log of the viral load in the sample, but we describe the model on the Ct scale to match the data. Because we are interested in the population-level distribution and not individual trajectories, we assumed that observed Ct values

*a*days after infection,

*C*(

*a*), followed a Gumbel distribution with location (mode) parameter

*C*(

_{mode}*a*) and scale parameter σ(

*a*) that also may depend on the number of days

*a*after infection. We chose a Gumbel distribution to capture overdispersion of high measured Ct values. This distribution captures the variation resulting from both swabbing variability and individual-level differences in viral kinetics. We note that at any point in the infection, there is a considerable amount of person-to-person and swab-to-swab variation in viral loads (

*50*–

*52*), including a possible difference by symptom status (

*15*,

*53*,

*54*). Tracking individual-level viral kinetics would require a hierarchical model capturing individual-level parameters, but is not necessary for this analysis.

The rationale behind this parameterization and the chosen parameter values is discussed in Supplementary Materials: Selecting Viral Kinetics and Compartmental Model Parameters. We note that in all analyses, we used informative priors for key features of viral load kinetics rather than fixing point estimates, incorporating uncertainty into our inference. The process for generating these priors is described in Supplementary Materials: Informing the Viral Kinetics Model. We performed this calibration step separately for the long-term care facility and BWH datasets, as the gene targets and testing platform were different, and thus Ct values are not directly comparable.

### Relationship Between Observed Ct Values and Daily Probability of Infection

#### Single Cross-Section Model

For a single testing day *t*, let *A _{max}* days to 1 day prior to

*t*, respectively, where

*t*−

*A*is the earliest day of infection that would result in detectable PCR values on the testing day. That is, π

_{max}

_{t}_{ − }

*is the probability that a randomly-selected individual in the population was infected on day*

_{a}*t*−

*a*. Let

*p*(

_{a}*x*) be the probability that the Ct value is

*x*for a test conducted

*a*days after infection given that the value is detectable (i.e., the Gumbel probability density function normalized to the observable values). Then

*p*(

_{a}*x*) =

*P*[

*C*(

*a*) =

*x*]/

*P*[0 ≤

*C*(

*a*) <

*C*], where

_{LOD}*P*[

*C*(

*a*) =

*x*] is the Gumbel probability density function with location parameter

*C*(

_{mode}*a*) and scale parameter σ(

*a*). Let

*a*days after infection, which depends on

*C*(

*a*) and any additional decline in detectability. Let the PCR test results from a sample of

*n*individuals be recorded as

*X*

_{1},…,

*X*. Then, for

_{n}*x*<

_{i}*C*(i.e., a detectable Ct value), the probability of individual

_{LOD}*i*having Ct value

*x*is given by:

_{i}*t*is:

*n*PCR values is given by:

If only detectable Ct values are recorded as *X*_{1},…,*X _{n}*, then the likelihood function is given by:

*R*is a parameter estimated by the model but does not vary over time (i.e., there are no interventions that reduce transmissibility). See Supplementary Materials: Parametric Models for Fitting Cross-Sectional Viral Load Data for details of the likelihoods used in these methods.

_{0}#### Multiple Cross-Sections Model

Now we consider settings where there are multiple days of testing, *t*_{1},…,*t _{T}*. We again denote by π

*the probability of infection on day*

_{t}*t*and now denote the sampled Ct value for the

*i*

^{th}individual sampled on test day

*t*by

_{j}*i*∈ 1,…,

*n*for test day

_{j}*j*and

*j*∈ 1,…,

*T*. Note that individual

*i*may refer to different individuals on different testing days. Let {π

*} be the daily probabilities of infection for any day*

_{t}*t*where an infection on day

*t*could be detectable using a PCR test on one of the testing days. By a straightforward extension of the likelihood for the single cross-section model, the nonparametric likelihood for the set of infection probabilities {π

*}, when samples with and without a detectable Ct value are included, is given by:*

_{t}*t*.

_{j}Only considering samples with a detectable Ct value gives the likelihood:

The multiple cross-section likelihood is primarily used to fit the Gaussian Process model, estimating the daily probability of infection, {π* _{t}*}, conditional on the set of observed Ct values. (See Supplementary Materials: Parametric Models for Fitting Cross-Sectional Viral Load Data). The SEIR model can be used with multiple testing days as well. It is fit as described for the single cross-section model, but with one of these likelihoods in place of the single cross-section model likelihood, with posterior distribution estimates obtained via MCMC fitting.

### Markov Chain Monte Carlo Framework

All models, including those using Ct values (*SEIR Model*, *Exponential Growth Model* and the *Gaussian Process Model*) and those using only prevalence (*SEIR Model*, *SEEIRR Model*) were fitted using a Markov chain Monte Carlo framework. We used a Metropolis-Hastings algorithm to generate either multivariate Gaussian or univariate uniform proposals. For all single-cross section analyses (Figs. 2 and 3), we used a modified version of this framework with parallel tempering: an extension of the algorithm that uses multiple parallel chains to improve sampling of multimodal posterior distributions (*47*). For the multiple cross section analyses including those in Fig. 4, we used the unmodified Metropolis-Hastings algorithm because the computational time of the parallel tempering algorithm is far longer, and these analyses were underpinned by more data and less affected by multimodality. In all analyses, three chains were run upward of 80,000 iterations (500,000 iterations for the Gaussian process models). Convergence was assessed based on all estimated parameters having an effective sample size greater than 200 and a potential scale reduction factor *coda* R package (*55*). All assumed prior distributions are described in table S1.

### Simulated Data

All simulated data were generated under the same framework but with different models and assumptions for the underlying epidemic trajectory. For each simulation, data were generated in four steps: 1) the daily probability of transmission, {π* _{t}*}, is calculated using either a deterministic SEIR, a stochastic SEIR model, or a Gaussian Process model; 2) On each day of the simulation, new infections are simulated under the model

*I*~

_{t}*Binomial*(

*N*, π

*), where*

_{t}*N*is the population size of the simulation and

*I*is the number of new infections on day

_{t}*t*(all other individuals are assumed to have escaped infection); 3) A subset of individuals are sampled on particular days of the simulation determined by the testing schemes described below and in Supplementary Materials: Comparison of Analysis Methods; 4) For each individual sampled on day

*u*, a Ct value was simulated under the model

*X*~

_{i}*Gumbel*(

*C*(

_{mode}*u*−

*t*), σ(

_{inf}*u*−

*t*)), where

_{inf}*t*is the time of infection for individual

_{inf}*i*.

*C*(

_{mode}*u*−

*t*) and σ(

*u*−

*t*) are described in Supplementary Materials: Ct Value Model.

### Simulated Testing Schemes

Standard approaches to estimating doubling time, growth rate, or *R _{t}* are subject to misestimation as a result of changes in testing policies (

*5*). To assess the effect of such changes on our methods, we simulate changes in testing rates and assess the effect on several methods for

*R*estimation: using

_{t}*EpiNow2*with reported case counts (

*33*), using Ct-based methods with random surveillance samples, and using PCR test positivity alone with surveillance samples. We test these methods at two periods of an outbreak: once when the epidemic is rising and once when it is falling. For the random samples for each of these analysis time points, we test from one to three days of sampling for virologic testing with varying sample sizes across the test days. Results are shown in Fig. 3 and fig. S13; more details are in the Supplementary Materials: Comparison of Analysis Methods.

## Supplementary Materials

science.sciencemag.org/cgi/content/full/science.abh0635/DC1

Materials and Methods

Figs. S1 to S19

Table S1

MDAR Reproducibility Checklist

Movies S1 to S3

Data S1 and S2

This is an open-access article distributed under the terms of the Creative Commons Attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## References and Notes

**Acknowledgments:**We thank Steven Riley for helpful discussions. IRB approval was obtained from Harvard School of Public Health for use of de-identified hospital-based viral load data (IRB 20-1703) and the Broad institute ORSP provided an exempt approval state for use of surveillance-based Ct values.

**Funding:**U.S. National Institutes of Health Director’s Early Independence Award DP5-OD028145 (MJM, JAH). Morris-Singer Fund (LKS, ML). U.S. Centers for Disease Control and Prevention Award U01IP001121 (LKS, ML). U.S. National Institute of General Medical Sciences award U54GM088558 (JAH, LKS, ML). U.S. National Cancer Institute of the National Institutes of Health Award U01CA261277 (ML). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

**Author contributions:**Conceptualization: JAH, LKS, ML, MJM. Methodology: JAH, LKS, ML, MJM. Visualization: JAH, LKS, ML, MJM. Investigation: JAH, LKS, SK, ML, MJM. Resources: SK, NJL, SBG, MJM. Data curation: JAH, SK, NJL, SBG, MJM. Software: JAH, LKS, SK, NJL, SBG. Funding acquisition: ML, MJM. Supervision: ML, MJM. Writing – original draft: JAH, LKS, SK, ML, MJM. Writing – review and editing: JAH, LKS, SK, NJL, SBG, ML, MJM.

**Competing interests:**ML discloses honoraria/consulting from Merck, Affinivax, Sanofi-Pasteur, and Antigen Discovery; research funding (institutional) from Pfizer, and unpaid scientific advice to Janssen, Astra-Zeneca, and Covaxx (United Biomedical). MJM is a medical advisor for Detect. All other authors declare no competing interests.

**Data and materials availability:**All code to perform the analyses and generate the figures presented in this article is available under the GNU General Public License Version 3 at https://github.com/jameshay218/virosolver_paper (

*56*) and https://github.com/jameshay218/virosolver (

*57*). Simulated data and real data used in the analyses are also available at https://github.com/jameshay218/virosolver_paper (

*56*). For the model fitting, code for the Markov chain Monte Carlo framework is available at https://github.com/jameshay218/lazymcmc (

*58*) and https://github.com/jameshay218/lazymcmc/tree/parallel_tempering ((

*58*), see v0.9). The authors used code developed by Abbott

*et al*. to estimate

*R*from reported case counts; this is available at https://github.com/epiforecasts/EpiNow2 (

_{t}*59*). We note that this package has since been updated; we reference the version of code used in these analyses. This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/. This license does not apply to figures/photos/artwork or other content included in the article that is credited to a third party; obtain authorization from the rights holder before using such material.