## Predicting Disease Dissemination

In combating the global spread of an emerging infectious disease, answers must be obtained to three crucial questions: Where did the disease emerge? Where will it go next? When will it arrive? **Brockmann and Helbing** (p. 1337; see the Perspective by **McLean**) analyzed disease spread via the “effective distance” rather than geographical distance, wherein two locations that are connected by a strong link are effectively close. The approach was successfully applied to predict disease arrival times or disease source using data from the the 2003 SARS viral epidemic, 2009 H1N1 influenza pandemic, and the 2011 foodborne enterohaemorrhagic *Escherichia coli* outbreak in Germany.

## Abstract

The global spread of epidemics, rumors, opinions, and innovations are complex, network-driven dynamic processes. The combined multiscale nature and intrinsic heterogeneity of the underlying networks make it difficult to develop an intuitive understanding of these processes, to distinguish relevant from peripheral factors, to predict their time course, and to locate their origin. However, we show that complex spatiotemporal patterns can be reduced to surprisingly simple, homogeneous wave propagation patterns, if conventional geographic distance is replaced by a probabilistically motivated effective distance*.* In the context of global, air-traffic–mediated epidemics, we show that effective distance reliably predicts disease arrival times. Even if epidemiological parameters are unknown, the method can still deliver relative arrival times. The approach can also identify the spatial origin of spreading processes and successfully be applied to data of the worldwide 2009 H1N1 influenza pandemic and 2003 SARS epidemic.

The geographic spread of emergent infectious diseases affects the lives of tens of thousands or even millions of people (*1*, *2*). Recent examples of emergent diseases are the SARS epidemic of 2003, the 2009 H1N1 influenza pandemic, and most recently a new strain (H7N9) of avian influenza virus (*3*, *4*). Progressing worldwide urbanization, combined with growing connectivity among metropolitan centers, has increased the risk that highly virulent emergent pathogens will spread (*5*–*8*). The complexity of global human mobility, particularly air traffic (Fig. 1A), makes it increasingly difficult to develop effective containment and mitigation strategies on the time scale imposed by the speed at which modern diseases can spread (*9*–*11*). Because timely, accurate, and focused action can potentially save the lives of many people and reduce the socio-economic impact of infectious diseases (*12*, *13*), understanding global disease dynamics has become a major 21st-century challenge. Unraveling the core mechanisms that underlie these phenomena and being able to distinguish key factors from peripheral ones are required to develop quantitative, efficient, and predictive models that public health authorities can employ to assess situations quickly, make informed decisions, and optimize vaccination and drug delivery plans. After the initial outbreak of an epidemic, the key questions are as follows: (i) Where did the novel pathogen emerge? (ii) Where are new cases to be expected? (iii) When is an epidemic going to arrive at distant locations? (iv) How many cases are to be expected?

Historically, for cases like the spread of the Black Death in Europe, reaction-diffusion models have been quite useful in addressing these questions (*14*, *15*). Despite their high level of abstraction, these models provide a solid intuition and understanding of spreading processes. Their mathematical simplicity permits the assessment of key properties, e.g., spreading speed, arrival times, and how pattern geometry depends on system parameters (*16*). However, because of long-distance travel, simple reaction-diffusion models are inadequate for the description of today’s complex, spatially incoherent spreading patterns that generically bear no metric regularity, that depend sensitively on model parameters and initial conditions (*17*–*20*) (Fig. 1, B to E, and fig. S2).

Consequently, scientists have been developing powerful, large-scale computational models and sophisticated, parameter-rich epidemic simulators that tackle the above key questions in detailed ways. These consider demographics, mobility, and epidemiological data, as well as disease-specific mechanisms, all of which are believed to play a role (*21*–*23*). Models range from high-level stochastic metapopulation models (*5*, *20*, *24*) to agent-based computer simulations that account for the behavior and interactions of millions of individuals in large populations (*25*). These approaches have become remarkably successful in reproducing observed patterns and predicting the temporal evolution of ongoing epidemics (*26*). Many such models reproduce similar dynamic features despite major differences in their underlying assumptions and data (*27*). The abundance of different, often mutually incompatible, models suggests that we still lack a fundamental understanding of the key factors that determine the observed spatiotemporal dynamics. It is unclear how the multitude of factors shape the dynamics and how much detail is required to achieve a certain level of predictive fidelity. Moreover, detailed computational models that incorporate all potentially relevant factors ab initio do not inform which factors are actually relevant and which ones are not (*28*). They are also hard to calibrate and of limited use when the knowledge of epidemiological parameters is uncertain.

Here, we propose an intuitive and efficient approach that remedies the situation by connecting the conceptual power of simple reaction-diffusion systems with the predictive power of high-level, computational models. Our approach is based on the idea of replacing conventional geographic distance by a measure of effective distance* *derived from the underlying mobility network. Based on this novel notion of distance, patterns that exhibit complex spatiotemporal structure in the conventional geographic perspective turn into regular, wavelike solutions reminiscent of simple reaction-diffusion systems. This permits the definition of effective epidemic wavefronts, propagation speeds, and the reliable estimation of epidemic arrival times, based on the knowledge of the underlying mobility network. The method, however, goes beyond remapping data. It provides two key insights. First, epidemiological parameters enter the spreading dynamics separately from the transport parameters, and second, the dynamics is dominated by only a small percentage of transport connections. Furthermore, our approach can quickly identify the geographic origin of emergent diseases, using temporal snapshots of the spatial disease distribution. This detection of the origin of complex, multiscale dynamical spreading patterns is important for three reasons: (i) to determine what has caused the disease, (ii) to develop timely mitigation strategies, and (iii) to predict its further spread (the arrival times in remote locations and the expected prevalence).

## Modeling Network-Driven Contagion Phenomena

For illustration, we consider a complex network of coupled populations (a metapopulation) in which the local disease time course is described by a conventional susceptible-infected-recovered (SIR) dynamics (*1*):

where *N _{n}* is the population size of population

*n*,

*M*is the number of populations, and

*S*,

_{n}*I*, are absolute numbers of susceptible, infected, and recovered individuals, respectively. Parameter β is the mean recovery rate of individuals (for influenza-like diseases β

_{n}^{–1}= 3 to 5 days), and

*R*

_{0}= α/β is the basic reproduction ratio (for which we assume typical values in the range 1.4 to 2.9). (The focus on SIR kinetics is not essential, as the following results are also valid for other types of local dynamics.) Each local population represents a node

*n*in the global mobility network (GMN), depicted in Fig. 1A. In addition to the local dynamics, individuals travel between nodes according to the rate equation

where *U _{n}* is a placeholder for the classes

*S*,

_{n}*I*, and

_{n}*R*. The quantities

_{n}*w*=

_{nm}*F*/

_{nm}*N*represent the per-capita traffic flux from location

_{m}*m*to

*n*. Weighted links

*F*quantify direct air traffic (passengers per day) from node

_{nm}*m*to node

*n*. The GMN is constructed from the worldwide air traffic between 4069 airports with 25,453 direct connections. Details on the data and network construction are provided in the supplementary materials (e.g., fig. S1 and table S1) (

*5*,

*13*,

*20*,

*29*). The total network traffic is approximately passengers per day. Assuming that the total traffic in and out of a node is proportional to its population size, Eqs. 1 and 2 can be rewritten as

with *s _{n}* =

*S*/

_{n}*N*,

_{n}*j*=

_{n}*I*/

_{n}*N*, and

_{n}*r*= 1 –

_{n}*s*–

_{n}*j*. A detailed derivation is provided in the supplementary text. The mobility parameter γ is the average mobility rate, i.e., , where is the total population in the system. This yields numerical values in the range γ = 0.0013 – 0.0178 day

_{n}^{–1}. The matrix with 0 ≤

*P*≤ 1 quantifies the fraction of the passenger flux with destination

_{mn}*m*emanating from node

*n*, i.e.,

*P*=

_{mn}*/*

_{}F_{mn}*F*, where . The additional sigmoid function with gain parameter η >> 0 accounts for the local invasion threshold ε and fluctuation effects for

_{n}*j*< ε (

_{n}*30*–

*32*). Typical parameter choices for ε and η are and . Our results are robust with respect to changes in these parameters (e.g., figs. S5 and S13).

Figure 1B shows a temporal snapshot of the dynamical system defined by Eq. 3 for a hypothetical pandemic with initial outbreak location (OL) in Hong Kong (HKG) (see also Fig. 2B and fig. S2 for temporal sequences of the dynamical system for various other OLs). Generally, the metapopulation model above and related models used in the past generate solutions that are characterized by similar qualitative features. First, only during the early stage of the process does the prevalence *j _{n}*(

*t*) (i.e., the fraction of infected individuals) correlate significantly with geographic distance from the OL. Second, at intermediate and later stages, the multiscale structure of the GMN induces a spatial decoherence of the spreading pattern. Third, despite the global connectivity, the spatiotemporal patterns do not converge to the same pattern, i.e., spatiotemporal differences are not a transient effect (figs. S3 to S6 and movies S1 to S3). This type of complexity sharply contrasts the generic behavior of ordinary reaction-diffusion systems, which typically exhibit spatially coherent wavefronts.

## Most Probable Paths and Effective Distance

The key idea we pursue here is that, despite the structural complexity of the underlying network, the redundancy of connections, and the multiplicity of paths a contagion phenomenon can take, the dynamic process is dominated by a set of most probable trajectories that can be derived from the connectivity matrix **P**. This hypothesis is analogous to the dominance of the smallest resistor in a strongly heterogeneous electrical network with parallel conducting lines. Given the flux-fraction 0 ≤ *P _{mn}* ≤ 1, i.e., the fraction of travelers that leave node

*n*and arrive at node

*m*, we define the effective distance

*d*from a node

_{nm}*n*to a connected node

*m*as

This concept of effective distance reflects the idea that a small fraction of traffic *n*→*m* is effectively equivalent to a large distance, and vice versa. As explained in more detail in the supplementary text, the logarithm is a consequence of the requirement that effective lengths are additive, whereas probabilities along multistep paths are multiplicative. Eq. 4 defines a quasi-distance, which is generally asymmetric, i.e., . The lack of symmetry is analogous to a road network of one-way streets, where the shortest distance from A to B may differ from the one from B to A. This asymmetry captures the effect that a randomly seeded disease in a peripheral node of the network has a higher probability of being transmitted to a well-connected hub than vice versa (figs. S8 to S10). More properties of effective distance as defined by Eq. 4 are discussed in the supplementary text. On the basis of effective distance, we can define the directed length of an ordered path as the sum of effective lengths along the legs of the path. Moreover, we define the effective distance *D _{mn}* from an arbitrary reference node

*n*to another node

*m*in the network by the length of the shortest path from

*n*to

*m*:

Again, we typically have . From the perspective of a chosen origin node *n*, the set of shortest paths to all other nodes constitutes a shortest path tree Ψ* _{n}* (Fig. 2A), illustrating the most probable sequence of paths from the root node

*n*to the other nodes.

## Effective Distance Perspective Reveals Hidden Pattern Geometry

The key question is how, compared to the conventional geographic representation, the same spreading process evolves on the shortest path tree. Figure 2B portrays this comparison. We see that the effective distance representation has notable advantages: It reveals simple coherent wave fronts, whereas spatiotemporal patterns in geographical space are complex, incoherent, and hard to understand. This is a generic feature that is robust against variations in epidemic parameters and true for any choice of the OL (figs. S11 and S12). Using effective distance, one can thus calculate the spreading speed and arrival times of a disease, and determine functional relationships between epidemiological and mobility parameters. The dynamic simplicity in the new representation is much more than just a trivial visual rearrangement of the spatiotemporal pattern. Simple propagating waves in the new perspective imply that the contagion process is dominated by most probable paths, as this is the underlying assumption in the derivation of Eq. 5. Also, effective distance and the shortest path trees only depend on the static mobility matrix **P**. This implies that, on a spatial scale described by the metapopulation model (Eq. 3), the complexity of the spatiotemporal pattern is largely determined by the structure of the mobility component in Eq. 3 and not by the nonlinearities or the disease-specific, epidemiological rate parameters of the model.

Figure 2C presents the correlation of arrival times *T*_{a} with effective distances *D*_{eff} for the simulation shown in Fig. 2B. Compared to Fig. 1C, this demonstrates that effective distance generates a much higher correlation than geographic distance ( compared to ; see tables S2 and S3 and fig. S12 for more examples). Furthermore, the relationship of *T*_{a} and *D*_{eff} is linear, which means that the effective speed *v*_{eff} = *D*_{eff}/*T*_{a} of the wavefront is a well-defined constant. To compare the regression quality, we computed the distribution of relative residuals *r* = δ*T*_{a}/*T*_{a}, using effective or geographic distance as a regressor. The ratio of residual variances implies a more than 50-fold higher prediction quality (table S3 and fig. S13).

Although we have demonstrated the clear linear functional relationship for simulated, hypothetical scenarios of global disease spread, it is crucial to test the validity and usefulness of the effective distance approach on empirical data. Figure 2, D and E, depict arrival time versus effective distance on the basis of data for the 2009 H1N1 pandemic and the global 2003 SARS epidemic, respectively (figs. S14 to S16 and table S4). Arrival times are the same as in Fig. 1, D and E, but shown across effective rather than geographic distances. As the empirical data are available on a country resolution, we determined the traffic between countries by aggregation to specify a coarse-grained network (GMNc) (189 nodes, 5004 links) and effective distances from the origin location in each case (see supplementary text for details). Both the H1N1 and SARS data exhibit a clear linear relationship between arrival time and effective distance from the source, even though additional factors complicate the spreading of real diseases. Fluctuations, effects due to coarse graining, and errors in arrival-time measurements can add noise to the system, which increases the scatter in the linear relationship. To address the general validity of the observed effects, we also analyzed data generated by the global epidemic and mobility model (GLEAM) (www.gleamviz.org), a sophisticated epidemic simulation framework (*21*). GLEAM incorporates air transportation and local commuter traffic on a global scale, is fully stochastic, and permits the simulation of infectious state–dependent mobility behavior, clinical states, antiviral statement, and more. The results of this analysis are shown in figs. S17 to S19 and are consistent with our claims.

## Relative Arrival Times Are Independent of Epidemic Parameters

Our results reveal an important, approximate relationship between the system parameters, which can be summarized as follows:

(6)This equation states that arrival times can be computed with high fidelity based on the effective distances *D*_{eff} and effective spreading speed *v*_{eff}, and that each factor depends on different parameters of the dynamical system. The epidemiological parameters determine the effective speed, whereas effective distance depends only on the topological features of the static underlying network, i.e., the matrix **P**. When confronted with the outbreak of an emergent infectious disease, one of the key problems is that the disease-specific parameters are typically unknown in the beginning, and simulations based on plausible parameter ranges typically exhibit substantial variability in predicted outcomes. However, Eq. 6 allows us to compute relative arrival times without knowledge of these parameters. If, for example, the outbreak node is labeled *k*, while *n* and *m* are arbitrary nodes, then *T*_{a}(*n*|*k*)/*T*_{a}(*m*|*k*) = *D*_{eff}(*n*|*k*)/*D*_{eff}(*m*|*k*). Equation 6 states that the effective speed *v*_{eff} is a global property, independent of the mobility network and the outbreak location. Thus, irrespective of mobility and OL, one can investigate how the effective speed depends on rate parameters of the system.

## Origin of Outbreak Reconstruction Based on Effective Distance

The concept of effective distance is particularly valuable for solving the aforementioned inverse problem: Given a spatially distributed prevalence pattern that was generated by an underlying, and potentially hidden, spreading mechanism, how can we determine the most likely initial outbreak location (*33*, *34*)? A crucial property of the effective distance perspective is that spreading concentricity (Fig. 2B) is only observed from the perspective of the actual OL. Therefore, given a spatiotemporal snapshot of the spreading dynamics (Fig. 3A), we can represent the spreading process from the perspective of candidate OLs (Fig. 3C) and determine the degree of concentricity of the pattern in effective distance. Compared to alternative OL candidates, the representation for the actual OL exhibits the most concentric shape, identifying this node as the correct outbreak location. This type of qualitative analysis can be made more systematic by introducing a measure for concentricity. We investigated two conceptually different approaches. First, for every one of the 4069 potential outbreak locations *n*, we computed the shortest path tree Ψ* _{n}*, the effective distance to all other locations

*D*

_{eff}(

*m*|

*n*), and arrival times

*T*

_{a}(

*m*|

*n*). For each candidate node, we computed the correlation coefficient

*c*(

*T*

_{a},

*D*

_{eff}) of effective distance and arrival time for candidate location

*n*. This approach should yield the highest correlation when

*n*is the actual outbreak location. As expected, this is the case for the H1N1 and SARS data sets (Fig. 4, A and B). However, this approach requires knowledge of the entire time course of the epidemic, e.g., arrival times at all locations, which is typically not available in real situations. Therefore, we used an alternative approach, mathematically similar to surface roughness characterization (

*35*), that only requires dynamic information in a small time window, e.g., one snapshot of the spreading pattern. For each of the potential candidate outbreak locations, we computed the effective distance to the subset of nodes with prevalence above a certain threshold, e.g., the red symbols in the patterns of Fig. 3A or Fig. 1B. On the basis of this set of effective distances (denoted by

*F*), we compute the mean μ

_{F}(

*D*

_{eff}) and standard deviation σ

_{F}(

*D*

_{eff}). Concentricity increases with a combined minimization of mean and standard deviation (supplementary text). Figure 4C depicts the distribution of ensemble-normalized pairs [μ

_{F}(

*D*

_{eff}), σ

_{F}(

*D*

_{eff})] for a simulated scenario at four different times. For all time points, the actual outbreak location is well separated from the remaining point cloud and closest to the origin. This shows that the effective distance perspective is unique from the actual outbreak location and that knowledge of a temporal snapshot of the spreading state combined with knowledge of the underlying mobility network is a powerful tool for outbreak reconstruction.

Although our method works well for simulation-generated data (i.e., disease dynamics generated by Eq. 3), real data pose additional challenges: (i) Data are subject to inaccuracies and incompleteness in prevalence counts; (ii) fluctuations, not captured explicitly by our model, may play a particular role during the onset of an epidemic; and (iii) response and mitigation measures that can change the time course of disease dynamics are not accounted for by our model. Therefore, to assess the applicability of our approach in a realistic context, we validated the effective distance method using data on the 2009 H1N1 pandemic and the 2011 outbreak of food-borne enterohemorrhagic *Escherichia coli* (EHEC) O104:H4/HUS in Germany with ~4000 cases and 53 deaths. Although the application to the H1N1 pandemic is a proof-of-concept application, as finding the spatial origin on a country resolution was actually not the problem that we investigated here, reconstructing the spatial outbreak origin during the EHEC-HUS epidemic (district Uelzen in Northern Germany) was notoriously difficult because of the spatial incoherence of reported cases. For the application to H1N1, we used data of the worldwide prevalence count by country in weeks 14 to 30 of 2009 (*36*). For EHEC-HUS, we constructed a network of food distribution in Germany using a gravity model for transportation networks (*37*). For the spatial prevalence, we used data on case counts per district in Germany (*38*).

Figure 4D illustrates the results for H1N1. In analogy to Fig. 4C, we use four distinct time windows (weeks 24, 26, 27, and 29). The method successfully identifies Mexico as the source of this event, even though the time windows cover a 2-month period when the pandemic’s peak prevalence had already reached a broad geographical distribution (fig. S16). Only as late as week 29, another country (the United States) is incorrectly identified as the likely outbreak location.

Figure 4E depicts the analogous results for the 2011 EHEC-HUS epidemic, where disease spreading was promoted not by air transportation, but by food transport. The nodes in the network are 412 administrative districts in Germany, coupled by the food supply network of the country. As time windows, we chose weeks 3 to 6 after onset. For this epidemic, a local farm in Bienenbüttel, district Uelzen, was later identified as the source of contaminated sprouts (*38*). On the basis of prevalence distribution in the entire country, the effective distance method correctly identifies district Uelzen as the most likely geographic source. However, in this case, the separation in the mean/standard deviation diagram is not as pronounced as for disease spread by air passenger flows. Nevertheless, although the method cannot identify the OL with full reliability here, it dramatically reduces the set of potential origin locations.

In both real-world scenarios—the 2011 EHEC/HUS epidemic and the 2009 H1N1 pandemic—the OL reconstruction works surprisingly well, despite the intrinsic fluctuations and the low-incidence regime. The unexpected degree of predictability indicates that the set of links in the network contributing the shortest paths accumulates a substantial fraction of the overall transmission probability (and that this set is almost identical from the perspective of all nodes; fig. S20).

## Discussion

In summary, the analysis of global disease dynamics in the framework of effective distances enables researchers to understand complex contagion dynamics in multiscale networks with simple reaction-diffusion models. Given fixed values for epidemic parameters, our analysis shows that network and flux information are sufficient to predict the dynamics and arrival times. The method is a promising starting point for more detailed investigations, including the functional dependencies of key epidemic variables such as the spreading speed and related macroscopic quantities on epidemiological parameters. The successful application to real epidemic data suggests that our method is also of practical use. Finally, it seems promising to generalize the effective distance method to other contagion phenomena, such as human-mediated bioinvasion and the spread of rumors or violence, a subject of ever-more importance in an increasingly connected society.

## Supplementary Materials

www.sciencemag.org/content/342/6164/1337/suppl/DC1

Supplementary Text

Figs. S1 to S20

Tables S1 to S4

Movies S1 to S3

References (*40*–*45*)

## References and Notes

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
**Acknowledgments:**D.B. thanks W. Kath, D. Grady, W. H. Grond, O. Woolley-Meza, and A. Bentley for fruitful discussions and comments and W. Moers, B. May, and FuturICT for inspiration. We thank R. Brune for contributions during the early phase of the project and for work on the origin reconstruction and C. Thiemann for the development of the SPaTo network visualization tool (www.spato.net). Global mobility data was provided by OAG (www.oag.com), prevalence data of H1N1 and SARS by the WHO (www.who.int), and EHEC data by the RKI-Survstat (www3.rki.de/SurvStat/). This work was supported by the Volkswagen Foundation (project: “Bioinvasion and epidemic spread in complex transportation networks”) and partially supported by the ETH project “Systemic Risks, Systemic Solutions” (CHIRP II project ETH 48 12-1).