## Abstract

A range of applications, from predicting the spread of human and electronic viruses to city planning and resource management in mobile communications, depend on our ability to foresee the whereabouts and mobility of individuals, raising a fundamental question: To what degree is human behavior predictable? Here we explore the limits of predictability in human dynamics by studying the mobility patterns of anonymized mobile phone users. By measuring the entropy of each individual’s trajectory, we find a 93% potential predictability in user mobility across the whole user base. Despite the significant differences in the travel patterns, we find a remarkable lack of variability in predictability, which is largely independent of the distance users cover on a regular basis.

When it comes to the emerging field of human dynamics, there is a fundamental gap between our intuition and the current modeling paradigms. Indeed, although we rarely perceive any of our actions to be random, from the perspective of an outside observer who is unaware of our motivations and schedule, our activity pattern can easily appear random and unpredictable. Therefore, current models of human activity are fundamentally stochastic (*1*) from Erlang’s formula (*2*) used in telephony to Lévy-walk models describing human mobility (*3*–*7*) and their applications in viral dynamics (*8*–*10*), queuing models capturing human communication patterns (*11*–*13*), and models capturing body balancing (*14*) or panic (*15*). Yet the probabilistic nature of the existing modeling framework raises fundamental questions: What is the role of randomness in human behavior and to what degree are individual human actions predictable? Our goal here is to quantify the interplay between the regular and thus predictable and the random and thus unforeseeable, probing through human mobility the fundamental limits that characterize the predictability of human dynamics.

At present, the most detailed information on human mobility across a large segment of the population is collected by mobile phone carriers (*4*, *16*–*21*). Mobile carriers record the closest mobile tower each time the user uses his or her phone. Here we use a 3-month-long record, collected for billing purposes and anonymized by the data source, capturing the mobility patterns of 50,000 individuals chosen from ~10 million anonymous mobile phone users with the criteria that they visit more than two locations (tower vicinity) during the observational period and that their average call frequency *f* is ≥0.5 hour^{−1} [(*22*) sections S1 and S2].

The trajectories of two users with widely different mobility patterns are shown in Fig. 1A: The first user moves in the vicinity of *N* = 22 towers in a 30-km region, whereas the second visits as many as *N* = 76 towers spanning approximately a 90-km neighborhood. To understand the recurrent nature of individual mobility, we assigned to each user a mobility network (*23*) (Fig. 1B), in which nodes are the locations visited by the user (each location corresponding to a mobile phone tower, with about a 3-km^{2} reception area on average, representing the uncertainty in our ability to determine the user’s whereabouts), and links represent the observed movements between these. The uneven node sizes, corresponding to the percentage of time the user spent in the vicinity of the particular tower, indicate that individuals tend to spend most of their time in a few selected locations. Finally, each mobility network has an associated dynamical pattern (Fig. 1C), capturing the temporal sequence of towers visited by the user.

Entropy is probably the most fundamental quantity capturing the degree of predictability characterizing a time series (*24*). We assign three entropy measures to each individual’s mobility pattern: (i) The random entropy *N _{i}* is the number of distinct locations visited by user

*i*, capturing the degree of predictability of the user’s whereabouts if each location is visited with equal probability; (ii) the temporal-uncorrelated entropy

*p*(

_{i}*j*) is the historical probability that location

*j*was visited by the user

*i*, characterizing the heterogeneity of visitation patterns; (iii) the actual entropy,

*S*, which depends not only on the frequency of visitation, but also the order in which the nodes were visited and the time spent at each location, thus capturing the full spatiotemporal order present in a person’s mobility pattern. To be specific, if

_{i}*i*was observed at each consecutive hourly interval, the entropy

*S*is given by

_{i}*P*(

*T*′) is the probability of finding a particular time-ordered subsequence

_{i}*T*′ in the trajectory

_{i}*T*[(

_{i}*22*) section

*S4*]. Naturally, for each user,

To calculate the real entropy *S _{i}*, we need a continuous (e.g., hourly) record of a user’s momentary location. Mobile phone records provide location information only when a person uses his or her phone. The users tend to place most of their calls in short bursts (

*11*–

*13*,

*25*) (Fig. 1D), followed by long periods with no call activity, during which we have no information about the user’s location (Fig. 1C). This incompleteness of the collected data is captured by the parameter

*q*, representing the fraction of hour-long intervals when the user’s location is unknown to us. As Fig. 1E shows,

*P*(

*q*) across our user base peaked around

*q*= 0.7, which indicated that, for a typical user, we have no location update for about 70% of the hourly intervals, which masks the user’s real entropy

*S*. We therefore studied the dependence of the entropy

_{i}*S*(

*q*) on the incompleteness

*q*, which allowed us to extrapolate the entropy to

*q*= 0. We tested the method’s accuracy on the trajectory of 100 users whose whereabouts were recorded every hour [(

*22*) section

*S4*] and found that it performed well for

*q*< 0.8, which represented 92% of the users in our data set. We therefore removed 5000 users with the highest

*q*from our data set, which ensured that all remaining 45,000 users satisfied

*q*< 0.8.

To characterize the inherent predictability across the user population, we determined *S _{i}*,

*S*

_{i}^{unc}, and

*S*

_{i}^{rand}for each user

*i*; the obtained

*P*(

*S*),

*P*(

*S*

^{unc}), and

*P*(

*S*

^{rand}) distributions are shown in Fig. 2A. The most striking result is the prominent shift of

*P*(

*S*) compared with

*P*(

*S*

^{rand}). Indeed,

*P*(

*S*

^{rand}) peaks at

*S*

^{rand}≈ 6, which indicates that, on average, each update of the user’s location represents six bits per hour of new information; that is, a user who chooses randomly his or her next location could be found on average in any of

*P*(

*S*) peaks at

*S*= 0.8 indicates that the real uncertainty in a typical user’s whereabouts is not 64 but 2

^{0.8}= 1.74, i.e., fewer than two locations.

The typical distances covered by individuals during their daily mobility pattern, as captured by each user’s radius of gyration, *r*_{g}, follows a fat-tailed distribution (*4*), which indicates that, although most individuals’ daily activity is confined to a limited neighborhood of 1 to 10 km, a few users regularly cover hundreds of kilometers (fig. S2). These differences suggest that predictability should also follow a fat-tailed distribution. In other words, we expect that individuals who travel less should be easy to predict (small entropy), whereas those with large *r _{g}* should be much less predictable (high entropy).

An important measure of predictability is the probability Π that an appropriate predictive algorithm can predict correctly the user’s future whereabouts. This quantity is subject to Fano’s inequality (*24*, *26*). That is, if a user with entropy *S* moves between *N* locations, then her or his predictability Π ≤ Π^{max}(*S*,*N*), where Π^{max} is given by *S* = *H*(Π^{max}) + (1 – Π^{max}) log_{2}(*N* – 1) with the binary entropy function *H*(Π^{max}) = –Π^{max} log_{2}(Π^{max}) – (1 – Π^{max}) log_{2}(1 – Π^{max}). For a user with Π^{max} = 0.2, this means that at least 80% of the time the individual chooses his location in a manner that appears to be random, and only in the remaining 20% of the time can we hope to predict his or her whereabouts. In other terms, no matter how good our predictive algorithm, we cannot predict with better than 20% accuracy the future whereabouts of a user with Π^{max} = 0.2. Therefore, Π^{max} represents the fundamental limit for each individual’s predictability.

We determined Π^{max} separately for each user in the database. To our surprise, we found that *P*(Π^{max}) does not follow the fat-tailed distribution suggested by the travel distances, but it is narrowly peaked near Π^{max} ≈ 0.93 (Fig. 2B). This highly bounded distribution indicates that, despite the apparent randomness of the individuals’ trajectories, a historical record of the daily mobility pattern of the users hides an unexpectedly high degree of potential predictability. We have also determined the maximal predictability Π^{unc} and the random predictability Π^{rand} extracted from *S*^{unc} and *S*^{rand}. As Fig. 2B shows, the result is strikingly different—*P*(Π^{unc}) is extremely widely distributed and peaked at Π^{unc} ~ 0.3, which indicates that, if we rely only on the heterogeneous spatial distribution, the predictability across the whole population is insignificant and varies widely from person to person. Similarly, *P*(Π^{rand}) has a peak at Π^{rand} = 0, which suggests not only that Π^{rand} and Π^{unc} are ineffective as predictive tools, but also that a significant share of predictability is encoded in the temporal order of the visitation pattern.

How can we reconcile the wide variability in the observed travel distances, as captured by the fat-tailed *P*(*r _{g}*), with the highly bounded predictability observed across the user population? To answer this, we measured the dependency of Π

^{max}on

*r*, and found that, for

_{g}*r*≥ 10 km, predictability becomes largely independent of

_{g}*r*, saturating at Π

_{g}^{max}≈ 0.93 (Fig. 2C). Therefore, Fig. 2C explains the failure of our earlier hypothesis: Individuals with

*r*≥ 100 km, covering hundreds of kilometers on a regular basis, are just as predictable as those whose life is constrained to a

_{g}*r*≈ 10-km neighborhood, a saturation that lies behind the high predictability observed across the whole user base.

_{g}To determine how much of our predictability is really rooted in the visitation patterns of the top locations, we calculated the probability *n* most visited locations, where *n* = 2 typically captures home and work. Thus, ^{max}, as, even if our predictive algorithm is 100% accurate, it can foresee the future location only when the user is found in one of the top *n* locations monitored by the algorithm. As Fig. 2D shows, the top two locations (*n* = 2) offer only a 60% overall predictability. Gradually adding more locations increases

To understand the origin of the observed high potential predictability, we segmented each week into 24 × 7 = 168 hourly intervals, and within each hour, we identified for each user the most visited location (Fig. 3, A and B). For example, if, between 8 and 9 a.m. on Monday, a user was found 10 times at tower 1, twice at tower 2, and once at tower 3, we assumed that her most likely location during this hour will be tower 1. Next, we measured each user’s regularity, *R*, defined as the probability of finding the user in his most visited location during that hour. *R* represents a lower bound for predictability Π, as it ignores the temporal correlations in user mobility. We found that across the whole user base, *R* ≈ 0.7, which meant that, on average, 70% of the time the most visited location coincides with the user’s actual location. The pattern is time dependent: During the night, when most people tend to be reliably at home, *R* peaks at ≈ 0.9, but between noon and 1 p.m. and between 6 and 7 p.m., *R* has clear minima, corresponding to transition periods (travel to lunch or home). Indeed, if we measure the total number of distinct locations *N*(*t*) a user visited each hour (Fig. 3B), we find that moments of low regularity *R* correspond to significant increase in *N*(*t*), a signature of high mobility, and when *R* peaks there is a drop in *N*(*t*).

If the users were to move randomly between their *N* locations, then *R*^{rand} = 1/*N*, which is *R* ≈ 0.7. This gap once again indicates that the high regularity characterizing each user’s mobility represents a significant departure from the expectation that they will be random. In Fig. 3C, we plot the relative regularity *R*/*R*^{rand} as a function of *r _{g}*, observing a clearly increasing tendency. That is, counterintuitively, the relative regularity of users who travel the most (i.e., have high

*r*

_{g}) is higher than the relative regularity of the more homebound individuals.

To explore whether demographic factors influence the users’ regularity and predictability, we measured *R* and Π^{max} for different age and gender groups (fig. S10). It was surprising that we did not observe gender- or age-based differences in Π^{max}, but only a systematic, but statistically insignificant, gender-based difference emerged in regularity. We also explored the impact of home, language groups, population density, and rural versus urban environment on predictability and found only insignificant variations (figs. S8, S12, and S13). Finally, we did not find significant changes in user regularity over the weekends compared with their weekday mobility (fig. S8), which suggested that regularity is not imposed by the work schedule, but potentially is intrinsic to human activities.

In summary, the combination of the empirically determined user entropy and Fano’s inequality indicates that there is a potential 93% average predictability in user mobility, an exceptionally high value rooted in the inherent regularity of human behavior. Yet it is not the 93% predictability that we find the most surprising. Rather, it is the lack of variability in predictability across the population. Indeed, given the fat-tailed distribution of the distances over which users travel on a regular basis (see fig. S2), most individuals are well localized in a finite neighborhood, but a few travel widely. Furthermore, a number of demographic and external parameters, from age to population density and the number of towers visited, vary widely from user to user. It is not unreasonable to expect, therefore, that predictability should also vary widely: For people who travel little, it should be easier to foresee their location, whereas those who regularly cover hundreds of kilometers should have a low predictability. Despite this inherent population heterogeneity, the maximal predictability varies very little—indeed *P*(Π^{max}) is narrowly peaked at 93%, and we see no users whose predictability would be under 80%.

Although making explicit predictions on user whereabouts is beyond our goals here, appropriate data-mining algorithms (*19*, *20*, *27*) could turn the predictability identified in our study into actual mobility predictions. Most important, our results indicate that when it comes to processes driven by human mobility, from epidemic modeling to urban planning and traffic engineering, the development of accurate predictive models is a scientifically grounded possibility, with potential impact on our well-being and public health. At a more fundamental level, they also indicate that, despite our deep-rooted desire for change and spontaneity, our daily mobility is, in fact, characterized by a deep-rooted regularity.

## Supporting Online Material

www.sciencemag.org/cgi/content/full/327/5968/1018/DC1

Materials and Methods

SOM Text

Figs. S1 to S13

References

## References and Notes

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
Materials and methods are available as supporting material on
*Science*Online. - ↵
- ↵
- ↵
- ↵
- ↵
- We thank M. Gonzales, P. Wang, J. Bagrow, and J. A. Aslam for discussions and comments on the manuscript. This work was supported by the James S. McDonnell Foundation 21st Century Initiative in Studying Complex Systems, NSF within the Information Technology Research (DMR-0426737) and IIS-0513650 programs, the Defense Threat Reduction Agency award HDTRA1-08-1-0027, and the Network Science Collaborative Technology Alliance sponsored by the U.S. Army Research Laboratory under agreement number W911NF-09-2-0053. Z.Q. was supported by a fellowship from the China Scholarship Council.