## Abstract

Social networks have the surprising property of being “searchable”: Ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. We present a model that offers an explanation of social network searchability in terms of recognizable personal identities: sets of characteristics measured along a number of social dimensions. Our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems, including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.

In the late 1960s, Travers and Milgram (1) conducted an experiment in which randomly selected individuals in Boston, Massachusetts, and Omaha, Nebraska, were asked to direct letters to a target person in Boston, each forwarding his or her letter to a single acquaintance whom they judged to be closer than themselves to the target. Subsequent recipients did the same. The average length of the resulting acquaintance chains for the letters that eventually reached the target (roughly 20%) was about six. This reveals not only that short paths exist (2, 3) between individuals in a large social network but that ordinary people can find these short paths (4). This is not a trivial statement, because people rarely have more than local knowledge about the network. People know who their friends are. They may also know who some of their friends' friends are. But no one knows the identities of the entire chain of individuals between themselves and an arbitrary target.

The property of being able to find a target quickly, which we call searchability, has been shown to exist in certain specific classes of networks that either possess a certain fraction of hubs [highly connected nodes which, once reached, can distribute messages to all parts of the network (5–7)] or are built upon an underlying geometric lattice that acts as a proxy for “social space” (4). Neither of these network types, however, is a satisfactory model of society.

Here, we present a model for a social network that is based upon plausible social structures and offers an explanation for the phenomenon of searchability. Our model follows naturally from six contentions about social networks.

1) Individuals in social networks are endowed not only with network ties, but identities (8): sets of characteristics attributed to them by themselves and others by virtue of their association with, and participation in, social groups (9, 10). The term “group” refers to any collection of individuals with which some well-defined set of social characteristics is associated.

2) Individuals break down, or partition, the world hierarchically into a series of layers, where the top layer accounts for the entire world and each successively deeper layer represents a cognitive division into a greater number of increasingly specific groups. In principle, this process of distinction by division can be pursued all the way down to the level of individuals, at which point each person is uniquely associated with his or her own group. For purposes of identification, however, people do not typically do this, instead terminating the process at the level where the corresponding group size *g*becomes cognitively manageable. Academic departments, for example, are sometimes small enough to function as a single group but tend to split into specialized subgroups as they grow larger. A reasonable upper bound on group size (9) is *g* ≅ 100, a number that we incorporate into our model (Fig. 1A). We define the similarity*x _{ij}
* between individuals

*i*and

*j*as the height of their lowest common ancestor level in the resulting hierarchy, setting

*x*= 1 if

_{ij}*i*and

*j*belong to the same group. The hierarchy is fully characterized by depth

*l*and constant branching ratio

*b*. The hierarchy is a purely cognitive construct for measuring social distance, and not an actual network. The real network of social connections is constructed as follows.

3) Group membership, in addition to defining individual identity, is a primary basis for social interaction (10, 11) and therefore acquaintanceship. As such, the probability of acquaintance between individuals *i* and *j* decreases with decreasing similarity of the groups to which they respectively belong. We model this by choosing an individual *i* at random and a link distance *x* with probability*p*(*x*) = *c*exp[−α*x*], where α is a tunable parameter and *c* is a normalizing constant. We then choose a second node *j* uniformly among all nodes that are a distance *x* from *i*, repeating this process until we have constructed a network in which individuals have an average number of friends *z*. The parameter α is therefore a measure of homophily—the tendency of like to associate with like. When *e*
^{−α} ≪ 1, all links will be as short as possible, and individuals will connect only to those most similar to themselves (i.e., members of their own bottom-level group), yielding a completely homophilous world of isolated cliques. By contrast, when *e*
^{−α} = *b*, any individual is equally likely to interact with any other, yielding a uniform random graph (12) in which the notion of individual similarity or dissimilarity has become irrelevant.

4) Individuals hierarchically partition the social world in more than one way (for example, by geography and by occupation). We assume that these categories are independent, in the sense that proximity in one does not imply proximity in another. For example, two people may live in the same town but not share the same profession. In our model, we represent each such social dimension by an independently partitioned hierarchy. A node's identity is then defined as an*H*-dimensional coordinate vector*v⃗*
_{i}, where*v*
_{i}
^{h} is the position of node*i* in the *h*th hierarchy, or dimension. Each node*i* is randomly assigned a coordinate in each of *H*dimensions and is then allocated neighbors (friends) as described above, where now it randomly chooses a dimension *h* (e.g., occupation) to use for each tie. When *H* = 1 and*e*
^{−α} ≪ 1, the density of network ties must obey the constraint *z* < *g*.

5) On the basis of their perceived similarity with other nodes, individuals construct a measure of “social distance”*y _{ij}
*, which we define as the minimum ultrametric distance over all dimensions between two nodes

*i*and

*j*; i.e.,

*y*= min

_{ij}_{h}

*x*

_{ij}

^{h}. This minimum metric captures the intuitive notion that closeness in only a single dimension is sufficient to connote affiliation (for example, geographically and ethnically distant researchers who collaborate on the same project). A consequence of this minimal metric, depicted in Fig. 1B, is that social distance violates the triangle inequality—hence it is not a true metric distance—because individuals

*i*and

*j*can be close in dimension

*h*

_{1}, and individuals

*j*and

*k*can be close in dimension

*h*

_{2}, yet

*i*and

*k*can be far apart in both dimensions.

6) Individuals forward a message to a single neighbor given only local information about the network. Here, we suppose that each node*i* knows only its own coordinate vector*v⃗*
_{i}, the coordinate vectors*v⃗*
_{j} of its immediate network neighbors, and the coordinate vector of a given target individual*v⃗*
_{t}, but is otherwise ignorant of the identities or network ties of nodes beyond its immediate circle of acquaintances.

Individuals therefore have two kinds of partial information: social distance, which can be measured globally but which is not a true distance (and hence can yield misleading estimates); and network paths, which generate true distances but which are known only locally. Although neither kind of information alone is sufficient to perform efficient searches, here we show that a simple algorithm that combines knowledge of network ties and social identity can succeed in directing messages efficiently. The algorithm we implement is the same greedy algorithm Milgram suggested: Each member *i* of a message chain forwards the message to its neighbor *j* who is closest to the target *t* in terms of social distance; that is,*y _{jt}
* is minimized over all

*j*in

*i*'s network neighborhood.

Our principal objective is to determine the conditions under which the average length 〈*L*〉 of a message chain connecting a randomly selected sender *s* to a random target *t*is small. Although “small” has recently been taken to mean that 〈*L*〉 grows slowly with the population size *N*(13, 14), Travers and Milgram found only that chain lengths were short. Furthermore, these message chains had to be short in an absolute sense because at each step, they were observed to terminate with probability *p* ≅ 0.25 (1,15). We therefore adopt a more realistic, functional notion of efficient search, defining for a given message failure probability*p*, a searchable network as any network for which*q*, the probability of an arbitrary message chain reaching its target, is at least a fixed value *r*. In terms of chain length, we formally require *q* = 〈(1 −*p*)^{L}〉 ≥ *r*, and from this we can obtain an estimate of the maximum required 〈*L*〉 using the approximated inequality 〈*L*〉 ≤ ln*r*/ln(1 − *p*). For the purposes of this study, we set *r* = 0.05 and *p* = 0.25, yielding the stringent requirement that 〈*L*〉 ≤ 10.4 independent of the population size *N*. Figure 2A presents a typical phase diagram in*H* and α, outlining the searchable network region for several choices of *N*, *g* = 100, and*z* = *g *− 1 = 99.

Our main result is that searchable networks occupy a broad region of parameter space (α,*H*) which, as we argue below, corresponds to choices of the model parameters that are the most sociologically plausible. Hence our model suggests that searchability is a generic property of real-world social networks. We support this claim with some further observations and demonstrate that our model can account for Milgram's experimental findings.

First, we observe that almost all searchable networks display α > 0 and *H* > 1, consistent with the notion that individuals are essentially homophilous (that is, they associate preferentially with like individuals) but judge similarity along more than one social dimension. Neither the precise degree to which they are homophilous, nor the exact number of dimensions they choose to use, appears to be important—almost any reasonable choice will do. The best performance, over the largest interval of α, is achieved for*H* = 2 or 3—an interesting result in light of empirical evidence (16) that individuals across different cultures in small-world experiments typically use two or three dimensions when forwarding a message.

Second, as Fig. 2B shows, although increasing the number of independent dimensions from *H* = 1 yields a dramatic reduction in delivery time for values of α > 0, this improvement is gradually lost as *H* is increased further. Hence the window of searchable networks in Fig. 2A exhibits an upper boundary in*H*. Because ties associated with any one dimension are allocated independently with respect to ties in any other dimension, and because for fixed average degree *z*, larger *H*necessarily implies fewer ties per dimension, the network ties become less correlated as *H* increases. In the limit of large*H*, the network becomes essentially a random graph (regardless of α) and the search algorithm becomes a random walk. An effective decentralized search therefore requires a balance (albeit a highly forgiving one) of categorical flexibility and constraint.

Finally, by introducing parameter choices that are consistent with Milgram's experiment (*N* = 10^{8},*p* = 0.25) (1), as well as with subsequent empirical findings (*z* = 300, *H* = 2) (17, 16), we can compare the distribution of chain lengths in our model with that of Travers and Milgram (1) for plausible values of α and *b*. As Fig. 3 shows, we obtain 〈*L*〉 ≅ 6.7 for α = 1 and *b* = 10, indicating that our model captures the essence of the real small-world problem. This agreement is robust with respect to variations in the branching ratio, showing little change over the range 5 < *b* < 50.

Although sociological in origin, our model is relevant to a broad class of decentralized search problems, such as peer-to-peer networking, in which centralized servers are excluded either by design or by necessity, and where broadcast-type searches (i.e., forwarding messages to all neighbors rather than just one) are ruled out because of congestion constraints (6). In essence, our model applies to any data structure in which data elements exhibit quantifiable characteristics analogous to our notion of identity, and similarity between two elements—whether people, music files, Web pages, or research reports—can be judged along more than one dimension. One of the principal difficulties with designing robust databases (18) is the absence of a unique classification scheme that all users of the database can apply consistently to place and locate files. Two musical songs, for example, can be similar because they belong to the same genre or because they were created in the same year. Our model transforms this difficulty into an asset, allowing all such classification schemes to exist simultaneously, and connecting data elements preferentially to similar elements in multiple dimensions. Efficient decentralized searches can then be conducted by means of simple, greedy algorithms providing only that the characteristics of the target element and the current element's immediate neighbors are known.

↵* To whom correspondence should be addressed. E-mail: djw24{at}columbia.edu