Report

Identity and Search in Social Networks

See allHide authors and affiliations

Science  17 May 2002:
Vol. 296, Issue 5571, pp. 1302-1305
DOI: 10.1126/science.1070120

Abstract

Social networks have the surprising property of being “searchable”: Ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. We present a model that offers an explanation of social network searchability in terms of recognizable personal identities: sets of characteristics measured along a number of social dimensions. Our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems, including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.

In the late 1960s, Travers and Milgram (1) conducted an experiment in which randomly selected individuals in Boston, Massachusetts, and Omaha, Nebraska, were asked to direct letters to a target person in Boston, each forwarding his or her letter to a single acquaintance whom they judged to be closer than themselves to the target. Subsequent recipients did the same. The average length of the resulting acquaintance chains for the letters that eventually reached the target (roughly 20%) was about six. This reveals not only that short paths exist (2, 3) between individuals in a large social network but that ordinary people can find these short paths (4). This is not a trivial statement, because people rarely have more than local knowledge about the network. People know who their friends are. They may also know who some of their friends' friends are. But no one knows the identities of the entire chain of individuals between themselves and an arbitrary target.

The property of being able to find a target quickly, which we call searchability, has been shown to exist in certain specific classes of networks that either possess a certain fraction of hubs [highly connected nodes which, once reached, can distribute messages to all parts of the network (5–7)] or are built upon an underlying geometric lattice that acts as a proxy for “social space” (4). Neither of these network types, however, is a satisfactory model of society.

Here, we present a model for a social network that is based upon plausible social structures and offers an explanation for the phenomenon of searchability. Our model follows naturally from six contentions about social networks.

1) Individuals in social networks are endowed not only with network ties, but identities (8): sets of characteristics attributed to them by themselves and others by virtue of their association with, and participation in, social groups (9, 10). The term “group” refers to any collection of individuals with which some well-defined set of social characteristics is associated.

2) Individuals break down, or partition, the world hierarchically into a series of layers, where the top layer accounts for the entire world and each successively deeper layer represents a cognitive division into a greater number of increasingly specific groups. In principle, this process of distinction by division can be pursued all the way down to the level of individuals, at which point each person is uniquely associated with his or her own group. For purposes of identification, however, people do not typically do this, instead terminating the process at the level where the corresponding group size gbecomes cognitively manageable. Academic departments, for example, are sometimes small enough to function as a single group but tend to split into specialized subgroups as they grow larger. A reasonable upper bound on group size (9) is g ≅ 100, a number that we incorporate into our model (Fig. 1A). We define the similarityxij between individuals i andj as the height of their lowest common ancestor level in the resulting hierarchy, setting xij = 1 ifi and j belong to the same group. The hierarchy is fully characterized by depth l and constant branching ratio b. The hierarchy is a purely cognitive construct for measuring social distance, and not an actual network. The real network of social connections is constructed as follows.

Figure 1

(A) Individuals (dots) belong to groups (ellipses) that in turn belong to groups of groups, and so on, giving rise to a hierarchical categorization scheme. In this example, groups are composed of g = 6 individuals and the hierarchy has l = 4 levels with a branching ratio of b = 2. Individuals in the same group are considered to be a distance x = 1 apart, and the maximum separation of two individuals is x =l. The individuals i and jbelong to a category two levels above that of their respective groups, and the distance between them is xij = 3. Individuals each have z friends in the model and are more likely to be connected with each other the closer their groups are. (B) The complete model has many hierarchies indexed byh = 1…H, and the combined social distance yij between nodes i andj is taken to be the minimum ultrametric distance over all hierarchies yij = minh x ij h. The simple example shown here forH = 2 demonstrates that social distance can violate the triangle inequality: yij = 1 becausei and j belong to the same group under the first hierarchy and similarly yjk = 1 but iand k remain distant in both hierarchies, givingyik = 4 >yij +yjk = 2.

3) Group membership, in addition to defining individual identity, is a primary basis for social interaction (10, 11) and therefore acquaintanceship. As such, the probability of acquaintance between individuals i and j decreases with decreasing similarity of the groups to which they respectively belong. We model this by choosing an individual i at random and a link distance x with probabilityp(x) = cexp[−αx], where α is a tunable parameter and c is a normalizing constant. We then choose a second node j uniformly among all nodes that are a distance x from i, repeating this process until we have constructed a network in which individuals have an average number of friends z. The parameter α is therefore a measure of homophily—the tendency of like to associate with like. When e −α ≪ 1, all links will be as short as possible, and individuals will connect only to those most similar to themselves (i.e., members of their own bottom-level group), yielding a completely homophilous world of isolated cliques. By contrast, when e −α = b, any individual is equally likely to interact with any other, yielding a uniform random graph (12) in which the notion of individual similarity or dissimilarity has become irrelevant.

4) Individuals hierarchically partition the social world in more than one way (for example, by geography and by occupation). We assume that these categories are independent, in the sense that proximity in one does not imply proximity in another. For example, two people may live in the same town but not share the same profession. In our model, we represent each such social dimension by an independently partitioned hierarchy. A node's identity is then defined as anH-dimensional coordinate vectorv⃗ i, wherev i h is the position of nodei in the hth hierarchy, or dimension. Each nodei is randomly assigned a coordinate in each of Hdimensions and is then allocated neighbors (friends) as described above, where now it randomly chooses a dimension h (e.g., occupation) to use for each tie. When H = 1 ande −α ≪ 1, the density of network ties must obey the constraint z < g.

5) On the basis of their perceived similarity with other nodes, individuals construct a measure of “social distance”yij , which we define as the minimum ultrametric distance over all dimensions between two nodes i andj; i.e., yij = minh x ij h. This minimum metric captures the intuitive notion that closeness in only a single dimension is sufficient to connote affiliation (for example, geographically and ethnically distant researchers who collaborate on the same project). A consequence of this minimal metric, depicted in Fig. 1B, is that social distance violates the triangle inequality—hence it is not a true metric distance—because individualsi and j can be close in dimensionh 1, and individuals j andk can be close in dimension h 2, yeti and k can be far apart in both dimensions.

6) Individuals forward a message to a single neighbor given only local information about the network. Here, we suppose that each nodei knows only its own coordinate vectorv⃗ i, the coordinate vectorsv⃗ j of its immediate network neighbors, and the coordinate vector of a given target individualv⃗ t, but is otherwise ignorant of the identities or network ties of nodes beyond its immediate circle of acquaintances.

Individuals therefore have two kinds of partial information: social distance, which can be measured globally but which is not a true distance (and hence can yield misleading estimates); and network paths, which generate true distances but which are known only locally. Although neither kind of information alone is sufficient to perform efficient searches, here we show that a simple algorithm that combines knowledge of network ties and social identity can succeed in directing messages efficiently. The algorithm we implement is the same greedy algorithm Milgram suggested: Each member i of a message chain forwards the message to its neighbor j who is closest to the target t in terms of social distance; that is,yjt is minimized over all j ini's network neighborhood.

Our principal objective is to determine the conditions under which the average length 〈L〉 of a message chain connecting a randomly selected sender s to a random target tis small. Although “small” has recently been taken to mean that 〈L〉 grows slowly with the population size N(13, 14), Travers and Milgram found only that chain lengths were short. Furthermore, these message chains had to be short in an absolute sense because at each step, they were observed to terminate with probability p ≅ 0.25 (1,15). We therefore adopt a more realistic, functional notion of efficient search, defining for a given message failure probabilityp, a searchable network as any network for whichq, the probability of an arbitrary message chain reaching its target, is at least a fixed value r. In terms of chain length, we formally require q = 〈(1 −p)L〉 ≥ r, and from this we can obtain an estimate of the maximum required 〈L〉 using the approximated inequality 〈L〉 ≤ lnr/ln(1 − p). For the purposes of this study, we set r = 0.05 and p = 0.25, yielding the stringent requirement that 〈L〉 ≤ 10.4 independent of the population size N. Figure 2A presents a typical phase diagram inH and α, outlining the searchable network region for several choices of N, g = 100, andz = g − 1 = 99.

Figure 2

(A) Regions in H-α space where searchable networks exist for varying numbers of individual nodes N (probability of message failurep = 0.25, branching ratio b = 2, group size g = 100, average degree z =g − 1 = 99, 105 chains sampled per network). The searchability criterion is that the probability of message completion q must be at least r = 0.05. The lines correspond to boundaries of the searchable network region for N = 102,400 (solid), N = 204,800 (dot-dash), and N = 409,600 (dash). The region of searchable networks shrinks with N, vanishing at a finite value of N that depends on the model parameters. Note thatz < g is required to exploreH-α space because for H = 1 and α sufficiently large, an individual's neighbors must all be contained within their sole local group. (B) Probability of message completion q(H) when α = 0 (squares) and α = 2 (circles) for the N = 102,400 data set used in (A). The horizontal line shows the position of the threshold r = 0.05. Open symbols indicate that the network is searchable (qr) and closed symbols mean otherwise. For α = 0, searchability degrades with each additional hierarchy. For the homophilous case of α = 2 with a single hierarchy, less than 1% of all searches find their target (q ≅ 0.004). Adding just one other hierarchy increases the success rate to q ≅ 0.144, and qslowly decreases with H thereafter.

Our main result is that searchable networks occupy a broad region of parameter space (α,H) which, as we argue below, corresponds to choices of the model parameters that are the most sociologically plausible. Hence our model suggests that searchability is a generic property of real-world social networks. We support this claim with some further observations and demonstrate that our model can account for Milgram's experimental findings.

First, we observe that almost all searchable networks display α > 0 and H > 1, consistent with the notion that individuals are essentially homophilous (that is, they associate preferentially with like individuals) but judge similarity along more than one social dimension. Neither the precise degree to which they are homophilous, nor the exact number of dimensions they choose to use, appears to be important—almost any reasonable choice will do. The best performance, over the largest interval of α, is achieved forH = 2 or 3—an interesting result in light of empirical evidence (16) that individuals across different cultures in small-world experiments typically use two or three dimensions when forwarding a message.

Second, as Fig. 2B shows, although increasing the number of independent dimensions from H = 1 yields a dramatic reduction in delivery time for values of α > 0, this improvement is gradually lost as H is increased further. Hence the window of searchable networks in Fig. 2A exhibits an upper boundary inH. Because ties associated with any one dimension are allocated independently with respect to ties in any other dimension, and because for fixed average degree z, larger Hnecessarily implies fewer ties per dimension, the network ties become less correlated as H increases. In the limit of largeH, the network becomes essentially a random graph (regardless of α) and the search algorithm becomes a random walk. An effective decentralized search therefore requires a balance (albeit a highly forgiving one) of categorical flexibility and constraint.

Finally, by introducing parameter choices that are consistent with Milgram's experiment (N = 108,p = 0.25) (1), as well as with subsequent empirical findings (z = 300, H = 2) (17, 16), we can compare the distribution of chain lengths in our model with that of Travers and Milgram (1) for plausible values of α and b. As Fig. 3 shows, we obtain 〈L〉 ≅ 6.7 for α = 1 and b = 10, indicating that our model captures the essence of the real small-world problem. This agreement is robust with respect to variations in the branching ratio, showing little change over the range 5 < b < 50.

Figure 3

Comparison betweenn(L), the number of completed chains of lengthL, taken from the original small-world experiment (1) (bar graph) and from an example of our model withN = 108 individuals (filled circles with the line being a guide for the eye). The experimental data shown are for the 42 completed chains that originated in Nebraska. (We have excluded the 24 completed chains that originated in Boston as this would correspond to N ≅ 106.) The model parameters are H = 2, α = 1, b = 10, g = 100, and z = 300; message attrition rate is set at 25%; n(L) for the model is compiled from 106 random chains and is normalized to match the 42 completed chains that started in Nebraska. The average chain length of Milgram's experiment is ∼6.5, whereas the model yields 〈L〉 ≅ 6.7. The distributions compare well: A two-sided Kolmogorov-Smirnov test yields a P-value ofP ≅ 0.57, whereas for a χ2 test, χ2 ≅ 5.46 and P ≅ 0.49 (seven bins). (A large value of P supports the hypothesis that the distributions are similar.) Even without attrition, the model's average search time is 〈L〉 ≅ 8.5 and the median chain length is 8. The model does not entirely match the experimental data because the former requires approximately 360 initial chains to achieve 42 completions as compared with 196.

Although sociological in origin, our model is relevant to a broad class of decentralized search problems, such as peer-to-peer networking, in which centralized servers are excluded either by design or by necessity, and where broadcast-type searches (i.e., forwarding messages to all neighbors rather than just one) are ruled out because of congestion constraints (6). In essence, our model applies to any data structure in which data elements exhibit quantifiable characteristics analogous to our notion of identity, and similarity between two elements—whether people, music files, Web pages, or research reports—can be judged along more than one dimension. One of the principal difficulties with designing robust databases (18) is the absence of a unique classification scheme that all users of the database can apply consistently to place and locate files. Two musical songs, for example, can be similar because they belong to the same genre or because they were created in the same year. Our model transforms this difficulty into an asset, allowing all such classification schemes to exist simultaneously, and connecting data elements preferentially to similar elements in multiple dimensions. Efficient decentralized searches can then be conducted by means of simple, greedy algorithms providing only that the characteristics of the target element and the current element's immediate neighbors are known.

  • * To whom correspondence should be addressed. E-mail: djw24{at}columbia.edu

REFERENCES AND NOTES

View Abstract

Stay Connected to Science

Navigate This Article