## Abstract

Because the Internet is a public good and its numerous users are not charged in proportion to their use, it appears rational for individuals to consume bandwidth greedily while thinking that their actions have little effect on the overall performance of the Internet. Because every individual can reason this way, the whole Internet's performance can degrade considerably, which makes everyone worse off. An analysis of the congestions created by such dilemmas predicts that they are intermittent in nature with definite statistical properties leading to short-lived spikes in congestion. Internet latencies were measured over a wide range of conditions and locations and were found to confirm these predictions, thus providing a possible microscopic mechanism for the observed intermittent congestions of the Internet.

The explosive growth of the Internet has led to the emergence of a computational network that covers the globe, with a staggering number of users connected at any given time to each other, browsing World Wide Web (WWW) pages, sending messages, transferring files, and searching for information on a range of topics that defy classification. At the time this report was written, it was estimated that there are 80 million WWW pages in existence, with only half of them indexed at any given time by the best search engines (1, 2).

Although this phenomenal growth has created exciting opportunities for information access, economic transformation, and interactivity on a global scale, it has also revealed that the Internet is not immune to the problems intrinsic to any public good, as congestions sometimes threaten to render it useless. Like consumers of a natural resource or drivers during rush hour, Internet users and, in particular, “surfers” are faced with a social dilemma of the type exemplified by the well-known tragedy of the commons (3). Because Internet surfers are not charged in proportion to their use, it often appears rational for them to consume bandwidth greedily while thinking that their actions have little effect on the overall performance of the network. Because every individual can reason this way, the whole Internet's performance can degrade considerably, which makes everyone worse off. But as users experience a congested network, they reduce or even desist in their use, freeing up bandwidth that can once again be consumed, and so on.

Given this scenario and the fact that Internet surfing is completely uncoordinated, one might then expect that over time, the resulting latencies would fluctuate around some mean value determined by the capacity and protocols of the network. A study of these latencies within the context of standard models of social dilemmas, however, led to surprising results. Although most of the time the available bandwidth is not completely taken up by greedy behavior, we show that uncertainties about the state of the network lead to sudden spikes of congestion, sometimes called Internet “storms” (4), which subside when slowdowns become intolerable and users reduce or end their surfing. Moreover, we demonstrate that these intermittent spikes in congestion possess non-Gaussian statistical properties. Experiments in which we measured Internet latencies over a wide range of conditions and locations yielded data that corroborate these predictions. Thus, the unfolding of a social dilemma provides a possible microscopic mechanism for the observed intermittent congestion of the Internet.

In the simplest case of a social dilemma (5-7), users try to maximize the utility of a resource, which in this case is proportional to the speed with which they can access remote information. They are considered cooperators if they restrain their use or defectors if they consume bandwidth greedily. Collective benefits increase linearly in the number of cooperators, at a rate b per cooperating member (8). In the context of the Internet, b is proportional to the speed (or inverse latency) with which remote sites can be accessed, a quantity that exhibits fluctuations over many time scales. Each contributing individual bears a personal cost, c, that reflects the penalty paid for delaying access to the information needed. Let k_{i} denote whether member i is cooperating (k_{i} = 1) or defecting (k_{i} = 0). Then the utility U at time t for member i is given by(1)where the fraction of users cooperating, f = n/N, is the number of users cooperating divided by the size of the user community.

When all members contribute successfully, each receives net benefits b − c, which are independent of the group size. The production of the collective good represented by the bandwidth of the Internet becomes a social dilemma when b > c > b/N. In this regime, although the good of all is maximized when everyone refrains from overuse (b − c > 0), the dominant strategy in a one-time session on the Internet is to defect because the marginal benefit of personal restraint is less than the personal cost (b/n < c).

This logic changes when the interaction is ongoing, because expectations about the future will also influence an individual's decision to use up bandwidth. To study the dynamics of collective action, we used an equation that governs the transitions between cooperation and defection, in which the average fraction of users who cooperate, f, is determined by(2)where α is the average Poisson rate at which users reevaluate their preferences. The function ρ(f) is the average probability that defecting users will switch to a cooperative strategy (9), and so it contains the effect of users' expectations on their immediate decisions. It has been shown that for collective action problems in which many agents participate, this averaged mean-field equation is very accurate (10) and is therefore appropriate for a large system such as the Internet.

Thus, in this conception, users monitor the level of utility and, if latencies are noticeable, decide to switch their behavior between effective cooperation and defection by considering the cost of postponing their current activity as well as their expectations of the future state of the network. Specifically, users who experience congestion tend to believe that others do as well and expect that if they try again later, the network is likely to be less congested. Similarly, users who experience little congestion will tend to take advantage of it and consume as much bandwidth as they require. Thus, because users believe that other users behave similarly to them, they expect that if the network is heavily congested now, latencies will be lower later. Cooperation will be followed by cooperation and defection will be followed by defection on the part of other users. In addition, there is always uncertainty in the users' estimate of the current state of the Internet, which is parameterized by a Gaussian deviate variance σ. Taking account of these factors, ρ(f) = 1/2{1 + erf [(1 − f − f_{crit})/(σ)]}, where erf denotes the error function, f_{crit} = (1/Hα)[(Nc − b)/(b − c)], and H is the rate at which users discount future utility (11).

Equation 2 determines the average dynamics of the system when there is a diversity of user preferences (12). Its fixed points, obtained by setting the time derivative to zero, correspond to equilibrium situations in which most users are defecting or cooperating. Because the Internet does in fact work most of the time, we can consider its equilibrium state to be effectively one of cooperation, even though most individuals do not explicitly perceive their behavior in those terms. However, even in equilibrium, occasional fluctuations take place whereby individuals might switch over short periods of time from nonuse or restrained use to heavy use and vice versa. These switches are due to different levels of need and impatience in obtaining information, to the uncertainty that individuals have about the congestion of the network, and to random changes in the environment of the Internet. In particular, the random environment can be thought of as a Gaussian speed fluctuation, β, added to a mean performance benefit, so that b = b̄ + β(t), where 〈β(t)〉 = 0 and 〈β(t)β(t′)〉 = 2σ′δ(t − t′), with σ′ such that the changes are small compared to the mean.

Each of these uncertainties can cause an individual to mistakenly perceive the total amount of Internet use to be different from its actual value. Because of this misperception, the user might act against the equilibrium condition, causing the system to move away from the fixed point. The more uncertainty there is in the system, the more likely it is that there will be fluctuations away from the equilibrium state.

When fluctuations in the levels of restraint take place, the group recovers in time according to an equation that governs the dynamics of fluctuations away from the equilibrium state. With a small deviation from equilibrium denoted by δ, the fraction of individuals refraining from use at any given time will be given by f(t) = f_{o} − δ(t) with δ ≪ f_{o}, where, in this case, we consider a cooperative equilibrium for which f_{o} ≈ 1. Replacing this expression in Eq. 2 and using the stochastic variable b(t) in the expression for f_{crit}, we can perform a linear stability analysis around the fixed point to derive an equation for the dynamics of fluctuations around the cooperative equilibrium. One obtains(3)where the constant A = α[1 + (1/σ)] and the white noise B(t) = −(1 −*f*
_{o})β(*t*)/
*H*σ^{3}(*b − c*) has zero mean and standard deviation σ̄ = (1 −*f*
_{o})σ′/[
*H*σ^{3}(*b − c*)].

With these properties of the random function B(t), Eq.3 is a stochastic differential equation that can be analytically solved with the Stratonovich formalism (13). The solution is the stochastic process(4)where δ(0) is the initial value of the fluctuation and w_{t} is a Wiener process B(t) = dw_{t}/dt, such that 〈w_{t}〉 = 0 and 〈w_{t}
^{2}〉 = σ̄t. This equation shows that typical fluctuations away from overall restraint relax back exponentially fast to the equilibrium point, with a time constant, 1/A, that, as can be seen, is a complicated function of the user's evaluation rate and the network parameters.

On the other hand, the m th moments of δ, given by 〈[δ(t)]^{m}〉 = [δ(0)]^{m}〈e^{(−At+wt}
^{)m}〉 , show anomalous behavior. Because 〈e^{mwt}〉 = e^{m2}
^{σ̄t}, we get(5)As this clearly shows, the higher moments can actually grow in time, with increasing order, at a rate given by m(mσ̄ − A). Remarkably, the average number of defecting individuals, given by the first moment of δ(t), increases in time when the variance in the background speed makes the exponent in Eq. 5 positive (that is, σ̄ > A), whereas almost all realizations of the fluctuations relax exponentially fast to zero, as indicated by Eq. 4 (14).

In order to understand this paradox, it is useful to remember that the higher moments of δ(t) are related to the probability of very unlikely events. The fact that the moments grow in time indicates that the behavior of the system is dominated by occasional large bursts in which the number of individuals suddenly found not to be refraining is very large, leading to spikes or storms in Internet congestion.

This prediction is borne out in simulations of the full dynamics of this model of collective action with fluctuating benefits. The representative simulation, which was performed without any averaging or linearization, corroborates these analytical results and clearly shows the spikes in the number of defectors (Fig.1).

To understand the nature of these sudden bursts, we need to compute their quasi-stationary distribution from their evolution over relatively long time periods. Dividing both sides of Eq. 3 by δ(t) and integrating over a time interval, T, gives, up to a constant, ln [δ(t)/δ(0)] = Σ_{0}
^{T}B_{n}, where the integral of B(t) over time has been replaced by a sum over discrete time steps, at each of which the random integrand acquires the value B_{n}.

Because the values of B_{n} are normally distributed with zero mean and variance σ̄^{2}, we can evaluate the sum of the right-hand side by invoking the central limit theorem. Therefore, the logarithm of δ(t) is normally distributed with zero mean and variance σ̄^{2}. In other words, δ(t) is distributed lognormally, that is, according to a distribution whose density function is P(x) = (1/xσ) exp{−[(ln x)^{2}/2σ^{2}]} with the mean value of x given by μ = e^{σ2} and its variance given by μ^{2}(e^{σ2} − 1).

To verify the existence of the predicted congestion spikes in the Internet, we performed a series of experiments in which we measured round-trip times for Internet Control Message Protocol (ICMP) ping packets. The time it takes for the ping packet to make its round trip is a measure of the congestion of the network. Typically, these packets were forwarded through about 30 routers, which act as sources of speed fluctuations, thus justifying the assumption of delta-correlated noise in b. Over the course of the measurements, use of the “traceroute” command showed that the pings typically followed the same sequence of routers.

Our measurements of round-trip times for ping packets sent every minute over the course of a day demonstrated that there are daily periodicities in the level of congestion around which higher-frequency fluctuations occur. Because these daily periodicities are an exogenous source of congestion, we studied instead the latencies of pings sent at a higher frequency over shorter periods of time. During these experiments, the stochastic dynamics of the congestion could then be considered as quasi-stationary.

A time series of nearly 10,000 pings was sent from a public workstation at Stanford University in California to a WWW server in the United Kingdom over the course of about 45 min (Fig.2). Because the trans-Atlantic path taken by the packets is one of the most congested in the world and because the time difference between the United States and the United Kingdom is large, these data provide a test of the theory that is not complicated by too much variation in the background due to exogenous factors. The series clearly exhibits latency spikes, or storms, superimposed on a noisy background. The spikes are bounded in size because the Internet routers drop packets when not transmitted over long times. This kind of behavior was observed in many other such experiments, involving measured latencies between remote sites in the United States as well as within the San Francisco Bay area.

Because the time series of pings was measured at unequally spaced intervals, we calculated their autocorrelation function by first calculating the Lomb periodogram power spectrum and then taking the inverse Fourier transform (15). The autocorrelation function for the pings, which we calculated by taking the inverse Fourier transform of the Lomb periodogram for the observed 9821 latencies, showed that the ping latencies are uncorrelated in time. This supports the assumption of stationarity, because periodicity in the congestion over the course of the measurements would be revealed as structure in this function.

The histogram of ping times along with a superimposed lognormal distribution with the same mean and variance as the data is shown (Fig.3). As can be seen, a lognormal distribution appears to be a good fit, although the statistical significance is not high. This is to be expected because the network's built-in bounds on extreme latencies prevent the realization of any long tail in the distribution, which makes a quantitative statistical comparison difficult.

These results support the interpretation that large congestion peaks or storms are due to the aggregate behavior of individuals confronted with a social dilemma. Given their generality, they may also hold for urban and highway traffic. Concerning the statistics of the congestions, they show that care must be exercised when speaking about the average performance of the Internet, for, because the distribution of latencies has a large skewedness, the mode (which corresponds to typical behavior) does not equal the mean. Thus, data such as the Internet Weather Report provided by Matrix Information and Directory Services (MIDS) (4), which is averaged over thousands of sites, can be misleading in terms of the large congestions they imply. Rather, a typical Internet user would encounter latencies shorter than those exhibited in the MIDS data.

The existence of a well-defined statistical distribution of Internet latencies can also help in the design of restart algorithms for accessing remote sites of the WWW during times of high congestion. We showed earlier that the introduction of the notion of risk into the execution of computer programs leads to the construction of computational portfolios that can be used to increase the speed and reliability with which Web pages can be accessed (16). Thus, knowledge of the underlying distribution allows for speedy construction of such portfolios without having to sample data over very long times.

Another aspect of our experiments worth mentioning is that the measured autocorrelations do not exhibit the long time tails that have been found in studies of Ethernet packet flow rates (17). This difference can be attributed to the fact that those autocorrelations seem to correspond to background levels of traffic, which have different statistical properties than those of the higher-frequency congestion-related spikes that we studied.

Finally, we point out that congestion problems as presently experienced by users of the Internet would disappear if individuals were charged in proportion to their bandwidth consumption. Such schemes, which are well known in the context of highway and urban traffic (18, 19), are an active area of research in the Internet community (2, 20-22) and hold the promise of relieving congestion. In the meantime, the Internet in its present state offers a unique environment for studying the social dynamics underlying network externalities.