Opportunities and challenges in modeling emerging infectious diseases

See allHide authors and affiliations

Science  14 Jul 2017:
Vol. 357, Issue 6347, pp. 149-152
DOI: 10.1126/science.aam8335


The term “pathogen emergence” encompasses everything from previously unidentified viruses entering the human population to established pathogens invading new populations and the evolution of drug resistance. Mathematical models of emergent pathogens allow forecasts of case numbers, investigation of transmission mechanisms, and evaluation of control options. Yet, there are numerous limitations and pitfalls to their use, often driven by data scarcity. Growing availability of data on pathogen genetics and human ecology, coupled with computational and methodological innovations, is amplifying the power of models to inform the public health response to emergence events. Tighter integration of infectious disease models with public health practice and development of resources at the ready has the potential to increase the timeliness and quality of responses.

Public health emergencies driven by emerging infectious diseases are at the forefront of global awareness. From HIV in the 1980s to Zika virus’s (ZIKV’s) recent invasion of the Americas, models that mathematically capture disease processes have played a role in assessing the risk and framing the response to emerging pathogens. The most prominent, and perhaps most fraught, role of such models is to forecast the course of epidemics (1, 2). Yet, explicit representation of mechanisms of spread and persistence can help us to do far more than forecast incidence. Models can elucidate the properties of emergent pathogens (3, 4), uncover general principles of emergence (5), and compare potential mechanisms of spread and persistence (6).

Models are only as good as the data on which they rely. Data scarcity is the norm when a previously unknown pathogen emerges, amplifying uncertainty and obscuring key drivers of the epidemic. Misrepresentation of core mechanisms can bias inferences and potentially misdirect intervention efforts. The strengths of models must be considered in the context of the limitations and pitfalls of their use.

Here, we focus on emergent viruses, both because of the speed with which they can spread death and disease and because the dynamics of viral epidemics exemplify key principles in the modeling of infectious diseases. This focus should not, however, detract from the importance of nonviral emergence events, nor from the particular issues involved in modeling nonviral pathogens.

The “classical” dynamic modeling toolkit

The past decade has seen several viral emergence events, including the 2009 pandemic of H1N1 influenza, the emergence of Middle East Respiratory Syndrome–associated coronavirus (MERS-CoV) in the Arabian peninsula, the West African Ebola outbreak, and ZIKV’s invasion of the Americas. These diseases are very different: Pandemic H1N1 is spread person-to-person and is closely related to seasonally circulating influenza viruses (3); since its emergence, MERS-CoV has failed to persist outside of the Middle East, and the epidemic appears to be largely driven by zoonotic infections from camels (although a rapidly contained human-driven outbreak occurred in South Korea) (6); Ebola is extremely virulent and spread mostly through direct contact with very sick or dead cases (7); and ZIKV is a mosquito-transmitted virus, known for decades but recently discovered to be a cause of severe pathogenic disease after emerging in the Americas (8).

Despite these differences, the response to each emerging virus has relied on models grounded in the same dynamic principles and key data that have informed the response to emerging infections since at least the 1980s (Fig. 1). Then, Anderson and May used mathematical models to elucidate the key variables required for forecasting the future trajectory and impact of the emerging HIV epidemic (Box 1) (9). Although there have been substantial advances in our ability to assess disease threats—ranging from increased statistical rigor enabled by powerful computers to entirely new methods of inference driven by analyses of pathogen genetics (3, 4)—the underlying core principles remain the same.

Fig. 1 Classic zoonotic emergence to human-to-human transmission.

(1) Typically, emergence occurs after a pathogen circulating in an animal reservoir enters the human population. (2) The key to determining whether the pathogen will pose a sustained threat to humans is the basic reproductive number R0. (3) Generally, models assume every human is susceptible, but there may be substantial unseen immunity. (4) The generation time combined with R0 determines the speed of epidemic growth. (5) Asymptomatic and undetected cases and (6) superspreading events can have important impacts on disease dynamics and control that are not obvious from observed aggregate case counts. Reducing or eliminating transmission in these contexts can have a disproportionate impact on reducing R and controlling the epidemic. (2) R0 and (4) the generation time combined with (7) the case fatality ratio and (8) severe outcomes determine the impact of the disease on the human population. Mechanistic models are both informed and can be used to estimate these values, and (9) the combination of such models with biological sampling of a subset of cases can allow for inferences when simple observational data do not.

Box 1

Key dynamic quantities estimated early after an emergence event. There are several key dynamic quantities that determine the course of the epidemic and indicate needs for the structure of the response that should be identified rapidly after emergence:

  • Basic reproductive number (R0): The number of cases expected to be directly infected by a single index case in an immunologically naive population. This provides an estimate of the transmissibility of an emergent pathogen. If R0 < 1, the emerging pathogen will die out, whereas if R0 > 1, the pathogen can spread widely and cause a major epidemic or pandemic. R0 further determines the final size of epidemics in the absence of control measures.

  • Reproductive number (R): The number of cases expected to be directly infected by a single infected individual case in a population in which there is some underlying immunity.

  • Generation time: The time between a case becoming infected and that case causing other infections. Combined with R0, this determines the speed at which an epidemic spreads through the population.

  • Incubation period: The time from infection to symptom onset.

  • Latent period: The time from infection to becoming infectious.

  • Infectious period: The length of time that infected individuals can transmit.

  • Case fatality ratio: The proportion of cases that prove fatal.

  • Hospitalization rate/clinical attack rate: The proportion of cases in which disease is sufficiently severe as to result in hospitalization, potentially affecting detection via passive surveillance.

  • Asymptomatic proportion: The percent of infected individuals that do not develop recognizable symptoms.

When responding to an emerging virus, perhaps the first priority is measuring the distributions of R0 and generation time (Box 1). R0 is of particular interest because it determines whether the disease will die out after introduction. For example, early estimates of R0 for MERS-CoV were well below 1 (4, 10), whereas estimates for pandemic H1N1 were in the neighborhood of 1.5 (3, 11). The former remains confined to the Arabian peninsula, and to persist, it apparently requires continuous reseeding into the human population from camels, whereas the latter has established itself globally. Knowing the generation time allows us to estimate R0 from the growth in case numbers during the early (exponential growth) phase of an epidemic. Likewise, using these two values, relatively accurate short-term forecasts can be made early on with simple models (Fig. 2).

Fig. 2 Phases of the emergence process.

(A) Preemergence period, which can last years to decades and can feature occasional zoonotic transmission events. (B) Short-term postemergence period: the first several generations (lasting months to years, depending on the pathogen’s generation time), characterized by exponential growth. (C) Medium-term postemergence period: patterns driven by hard-to-predict aspects of pathogen ecology and human behavior. (D) Long-term postemergence period: general trends dictated by pathogen properties and basic epidemic theory.


Moving from forecasting cases to forecasting disease burden requires estimates of the risk of severe illness and mortality after infection. Dynamic aspects of both the disease and reporting processes, and potentially large numbers of unobserved infections (Fig. 1), mean that models of both are often necessary to estimate these quantities (12). Biases can go both ways; models of the time course of individual infections have been used to correct for underestimates of the case fatality rate for both SARS-CoV and Ebola early in each epidemic (12, 13), whereas statistical models that account for the structure of reporting have been used to adjust for upward biases because of underreporting of less severe cases of MERS-CoV (14).

Subsequent to characterizing the growth and potential health impact of an emerging epidemic, modeling efforts need to move into evaluation of control measures. The design and effectiveness of interventions, such as isolation and quarantine, depend critically on the distribution of the disease’s incubation period and latent period, as well as the frequency of asymptomatic infection (5, 15). More detailed models require more data or assumptions but can be used for strategic evaluation of particular interventions. In response to the Ebola epidemic, models were used to compare the impact of case isolation, contact-tracing with quarantine, and sanitary funeral practices (16) and to evaluate the impact of travel bans and exit and entry screening at airports (17). Such models may not accurately forecast the exact number of cases prevented by each intervention; their value is in providing an assessment of relative impact. Importantly, such models rely not only on accurate characterization of the epidemic process but should also include logistics, health systems, and human behaviors.

Information on R0, human demographics, and the distribution and duration of immunity also allows us to make general predictions about an emergent pathogen’s long-term impact (Fig. 2). For instance, basic models of the postinvasion dynamics of an immunizing infection predict a lull in ZIKV transmission in the Americas, starting a few years after its introduction and potentially lasting decades (2). The presence of preexisting immunity can also profoundly affect both short- and long-term epidemic patterns and is a critical, usually missing, component when forecasting epidemics.

Limitations of the classical toolkit and pitfalls in model-based approaches

Basic infectious disease models can forecast short-term incidence and broadly characterize long-term trends (Fig. 2) and can do so with increasing accuracy as new data become available. However, medium-term forecasts may require unattainably detailed information about biological, ecological, and social systems. Variations in patterns of infectious contact, arising from local contact structure to regional variation in mobility, are driven by changing human behavior. For example, even in the absence of formal directives, people instinctively changed their behavior in the presence of Ebola cases to reduce transmission. This phenomenon has helped control previous outbreaks in the Democratic Republic of Congo and elsewhere and may have contributed to mismatches between pessimistic forecasts of the 2014 West African Ebola outbreak and its observed trajectory (18). Control efforts may also change the trajectory of an epidemic, invalidating forecasts that do not adequately account for the effects of control. Variation in environmental suitability for transmission at small and large spatial scales may also shape spread in unanticipated ways. This is particularly true for vector-borne diseases, even when nuances of biological mechanism (for example, the effect of temperature on mosquito survival) are well known (19, 20).

The lack of data that makes mechanistic models essential to the response to an emergent pathogen also handicaps these models. Numerous modeling exercises and hard experience show that appropriate control measures must be implemented early in an emergent epidemic in order to be successful (21, 22). This sets up a Catch-22 scenario because the time when good models are most needed is when they are the hardest to make. The potential to publish high-profile results and a desire to have a meaningful impact on the public health response can lead to an explosion of modeling studies (a Pubmed search for “mathematical model Ebola” since 2014 yields more than 300 results). These may be based on data and methods of varying quality and are rarely followed by any attempt to synthesize results and reach consensus.

Efforts to “get ahead of the game” by collecting data and creating models before disease emergence may be doomed because emergent pathogens rarely conform to our expectations. For example, despite extensive work preparing for a flu pandemic (21, 23, 24), pandemic H1N1 violated model assumptions on nearly every point: It arose in a different part of the world than anticipated, was the same major subtype as an already circulating strain, and failed to displace all circulating influenza A subtypes. Likewise, before the start of the 2014 outbreak, Ebola was considered a well-characterized threat, of great concern in limited parts of East and Central Africa, and caused only small epidemics. Although models suggest that the size of the West African outbreak was not inconsistent with previous knowledge (25), its scale and location were completely unanticipated.

One solution to the lack of data is to increase reliance on our understanding of biological mechanisms. However, models, particularly complex models, that are heavily based on prior knowledge or mechanistic assumptions have the potential to be (sometimes spectacularly) wrong. Forecasts of the range of ZIKV provide a prime example. Several models driven by vector ecology—and in some cases, human sexual behavior—suggested a risk of major ZIKV outbreaks in the United States outside of the southernmost counties (2628)—a prediction inconsistent both with experience and observations from the ZIKV epidemic thus far. Models based on an assumed mechanism may be useful, reveal potential deviations from past experience, and galvanize public health action but should be paired with scrutiny of the data and investigation of why predictions deviate from experience.

Appropriately capturing and communicating uncertainty is a constant challenge when modeling emergent pathogens. Classical approaches that rely heavily on systems of differential equations are powerful but can easily neglect to account for statistical uncertainties arising from limited data and uncertainties arising from the role of chance events in the epidemic process (process uncertainty). The best recent work takes advantage of methods built on growing computational resources to capture both and also attempts to account for uncertainties driven by knowledge gaps. For instance, early work on pandemic H1N1 and MERS-CoV estimated R0 by using multiple approaches and assumptions, calculating confidence intervals for each, and reporting the full spectrum of results (3, 4). Epidemiologists must carefully evaluate which types of uncertainty are critical to capture, given the goals of their analysis; for example, process uncertainty may be critical when forecasting incidence but less so when comparing hypothetical interventions.

“…perhaps [the] most fraught role of such models is to forecast the course of epidemics.”

Communication of results from modeling exercises can be difficult. The media and other laypersons tend to focus on a model’s most dire predictions. In the fall of 2014, researchers at the Centers for Disease Control and Prevention (CDC), Atlanta, projected that in the absence of control measures, more than half of a million Ebola cases would be reported in Liberia and Sierra Leone by early 2015 (1). The actual numbers were an order of magnitude lower (7). Although these projections may have indeed been unduly dire, the fact is that control measures were implemented in the region, and the CDC forecasts accounting for the potential impact of interventions (for example, increasing use of specialized Ebola treatment units) were far closer to what actually occurred. However, the media almost exclusively focused on the extreme values, contributing to an impression that the forecast process was a failure (29). A focus on prediction exacerbates the problem; models may be successful in their aims of identifying planning scenarios or evaluating hypothetical interventions but still produce unrealistic or unrealized epidemic forecasts. Media headlines tend to miss the nuanced difference between forecasts and planning scenarios, such as occurred with early pandemic H1N1 planning exercises, in which attention focused almost exclusively on the high end of the worst-case scenarios (30).

Mechanistic models play an essential role in the response to emergence events. However, given the unknowns inherent in the situation, accurately characterizing an emerging pathogen is hard and always will be. Even when researchers are sensitive to limitations of their models, these limitations may be very hard to communicate. A focus on forecasting, intended or not, can obscure the broader value of mechanistic approaches: their ability to synthesize multiple complex data streams in an informative way, leveraging our understanding of biologic and epidemic process to improve situational awareness and reveal properties of pathogen transmission.

Improving modeling of emergent pathogens: Trends and opportunities

A clear trend over the past decade is the increasing availability of data on pathogen genetics. Whole-genome sequencing is ever more affordable, and novel analytic techniques are developing apace. Phylodynamic methods can characterize the timing of pathogen introduction—for example, indicating earlier ZIKV introduction into the Americas than originally thought (31) and suggesting both the single spillover origin and subsequent determinants of Ebola spread (32). Phylodynamic methods can also be paired with disease models to estimate R0 (3, 4), providing an alternative to analysis of case data for estimating this important quantity. However, the benefits of obtaining an estimate that is not subject to the same biases as case data should not obscure the fact that phylogenetic approaches have their own biases and limitations and are not a panacea.

Modeling of emergent pathogens has also benefited from the “big data” revolution driven by the ever-expanding pool of large data sets created through automated data collection. High-resolution (subkilometer) satellite-based measures of environmental variables, such as land surface temperature, is one novel big data stream that can be combined with other data sources via machine-learning algorithms to determine, for example, the likely range and local transmission intensity of vector-borne infections such as ZIKV (19, 20). Similar approaches have been used to disaggregate census data, yielding high-resolution maps of population density and demographics (33), thus tackling the perennial problem in epidemiology of determining the population at risk. Another trend is increasing availability of data streams that characterize mobility, such as air-travel flows (34). Novel data streams on mobility are also becoming increasingly important; mobile phone-call records yield unprecedented temporal and spatial resolution on human mobility and aggregation (35). As ever, potential biases and limitations must be considered carefully; for example, spatial locations can only be mapped when a call is made, and mobile phone ownership is not necessarily representative of populations of interest.

Both phylodynamic and “big data” techniques have been enabled by increasing availability of affordable, high-performance computing resources. These resources also allow for implementation of statistical and modeling techniques that were once prohibitively computationally expensive and have improved the rigor of models of emergent pathogens, particularly the quantification of uncertainty. Computationally intensive techniques that integrate across multiple predictive models (36) are leading to clear improvements in forecasting of established pathogens and may provide similar benefits for emergent pathogens. Likewise, with new techniques and enough computational power (though sometimes more than is currently available), essentially any probabilistic model construction can be fit to data. This allows researchers to combine often complex representations of the transmission process with techniques of statistical inference to estimate critical transmission properties while taking into account the large-scale uncertainty in the underlying transmission tree (37).

As models become more flexible and easier to fit, there is the promise of updating results in “real time” as the response to an emerging pathogen develops. Realizing this potential requires improvements in the way that data flow through health systems, as well as how data are combined and processed by models. Data must be updated in a timely manner, and forecasts and inferences must be sensibly adjusted as new data arrive and old data are modified. Rapid modeling exercises can be critical in making timely decisions and guiding interventions and field studies in a rapidly changing environment. For instance, models played a critical role in the design of vaccine trials during the Ebola outbreak (38). However, such real-time efforts remain sporadic and ad hoc.

The immunological landscape on which a pathogen emerges can have profound effects on its spread, with the immunologic imprint of related viruses potentially providing protection (39) or increasing disease severity in affected subpopulations (40). Immunological signatures also provide a marker of previous exposure and can reveal whether a pathogen is truly novel or has circulated undetected in human populations before. However, this immunological landscape has historically been part of the “dark matter” of epidemiologic information; serologic laboratory and analytic techniques have lagged behind developments in the molecular analysis of genomic data, and there are few sources of data on the preemergence immune status of populations. Establishment of a global serum bank, combined with improved methods for efficiently testing for a broad range of immunological markers, could provide an invaluable resource for responding to emergence events (41). However, a commensurate improvement to how such information is incorporated into disease models is also needed.

“Communication of results from modeling exercises can be difficult.”

The ultimate achievement in modeling emergent pathogens would be to develop models of sufficient biological and ecological sophistication to identify emergent disease threats before they entered the human population. Identifying and sequencing previously unknown viruses is a necessary first step, but “virus hunting” activities are of limited utility without some way to assess which viruses pose a threat. We know generalities—for example, that viruses are more likely to jump between closely related species (42), or that RNA viruses’ rapid rate of mutation may make them more prone to emergence events than DNA viruses (43). And we can theoretically assess how the relationship between introduction frequency, R0, and the number of mutations needed to efficiently transmit among humans affects emergence rates (43). Yet, we lack the depth of understanding of the relation between genotype and phenotype to assess which viruses will spread and cause disease in the human population and which will not (44). However, our ability to observe “viral chatter” between human and animal populations is ever increasing and may soon lead to the breakthroughs needed to identify likely emerging threats.

Future emergence events

Global biosecurity depends on our ability to effectively confront emerging infectious disease threats. Mechanistic models, which capture our scientific understanding of disease processes, will continue to play an important role in assessing and responding to pathogen emergence. Although these methods have numerous limitations and pitfalls, and it may sometimes be difficult to tell good work from bad, they provide vital information to the global health response that is unavailable through other means. Continued methodological improvements that take advantage of new sources of data will increase the range and accuracy of inferences that can be made from leveraging infectious disease models. Tighter integration with public health practice and development of resources at the ready may increase the timeliness and quality of analyses to inform the public health response.


Stay Connected to Science

Navigate This Article