Report

Superfamilies of Evolved and Designed Networks

See allHide authors and affiliations

Science  05 Mar 2004:
Vol. 303, Issue 5663, pp. 1538-1542
DOI: 10.1126/science.1089167

Abstract

Complex biological, technological, and sociological networks can be of very different sizes and connectivities, making it difficult to compare their structures. Here we present an approach to systematically study similarity in the local structure of networks, based on the significance profile (SP) of small subgraphs in the network compared to randomized networks. We find several superfamilies of previously unrelated networks with very similar SPs. One superfamily, including transcription networks of microorganisms, represents “rate-limited” information-processing networks strongly constrained by the response time of their components. A distinct superfamily includes protein signaling, developmental genetic networks, and neuronal wiring. Additional superfamilies include power grids, protein-structure networks and geometric networks, World Wide Web links and social networks, and word-adjacency networks from different languages.

Many networks in nature share global properties (1, 2). Their degree sequences (the number of edges per node) often follow a long-tailed distribution, in which some nodes are much more connected than the average (3). In addition, natural networks often show the small-world property of short paths between nodes and highly clustered connections (1, 2, 4). Despite these global similarities, networks from different fields can have very different local structure (5). It was recently found that networks display certain patterns, termed “network motifs,” at much higher frequency than expected in randomized networks (6, 7). In biological networks, these motifs were suggested to be recurring circuit elements that carry out key information-processing tasks (6, 810).

To understand the design principles of complex networks, it is important to compare the local structure of networks from different fields. The main difficulty is that these networks can be of vastly different sizes [for example, World Wide Web (WWW) hyperlink networks with millions of nodes and social networks with tens of nodes] and degree sequences. Here, we present an approach for comparing network local structure, based on the significance profile (SP). To calculate the SP of a network, the network is compared to an ensemble of randomized networks with the same degree sequence. The comparison to randomized networks compensates for effects due to network size and degree sequence. For each subgraph i, the statistical significance is described by the Z score (11): Math where Nreali is the number of times the subgraph appears in the network, and <Nrandi> and std(Nrandi) are the mean and standard deviation of its appearances in the randomized network ensemble. The SP is the vector of Z scores normalized to length 1: Math The normalization emphasizes the relative significance of subgraphs, rather than the absolute significance. This is important for comparison of networks of different sizes, because motifs (subgraphs that occur much more often than expected at random) in large networks tend to display higher Z scores than motifs in small networks (7).

We present in Fig. 1 the SP of the 13 possible directed connected triads (triad significance profile, TSP) for networks from different fields (12). The TSP of these networks is almost always insensitive to removal of 30% of the edges or to addition of 50% new edges at random, demonstrating that it is robust to missing data or random data errors (SOM Text). Several superfamilies of networks with similar TSPs emerge from this analysis. One superfamily includes sensory transcription networks that control gene expression in bacteria and yeast in response to external stimuli. In these transcription networks, the nodes represent genes or operons and the edges represent direct transcriptional regulation (6, 1315). Networks from three microorganisms, the bacteria Escherichia coli (6) and Bacillus subtilis (14) and the yeast Saccharomyces cerevisiae (7, 15), were analyzed. The networks have very similar TSPs (correlation coefficient c > 0.99). They show one strong motif, triad 7, termed “feedforward loop.” The feedforward loop has been theoretically and experimentally shown to perform signal-processing tasks such as persistence detection (6, 10), pulse generation, and acceleration of transcription responses (9). Triad 3, the 3-chain, is an anti-motif (a significantly underrepresented subgraph) corresponding to the shallow architecture of these networks, which have few long cascades (16). These networks are “sensory networks,” which need to respond within minutes to transient signals such as stresses and nutrients. The minimal time required for response (for the first proteins to be expressed) is indeed on the order of minutes. If the information needs to pass additional steps (a regulator protein needs to be expressed and cross its activation threshold to turn on a gene), then the response time is much longer. The response time in each cascade step has been experimentally and theoretically shown to be on the order of the gene-product lifetime (8), often tens of minutes. Thus, these networks are “rate-limited networks,” where the desired response times are often as short as the response times of the network components.

Fig. 1.

The triad significance profile (TSP) of networks fromvarious disciplines. The TSP shows the normalized significance level (Z score) for each of the 13 triads. Networks with similar characteristic profiles are grouped into superfamilies. The lines connecting the significance values serve as guides to the eye. The networks are as follows (where N and E are the number of nodes and edges, respectively) (12): (i) Direct transcription interactions in the bacteria E. coli (6) (TRANSC-E.COLI N = 424, E = 519) and B. subtilis (14) (TRANSC-B.SUBTILIS N = 516, E = 577) and in the yeast S. cerevisiae [TRANC-YEAST N = 685, E = 1052 (7) and TRANSC-YEAST-2 N = 2341, E=3969 (15)]. (ii) Signal-transduction interactions in mammalian cells based on the signal transduction knowledge environment (STKE, http://stke.sciencemag.org/) (SIGNAL-TRANSDUCTION N = 491, E = 989), transcription networks that guide development in fruit fly (from the GeNet literature database, www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm) (TRANSC-DROSOPHILA N = 110, E = 307), endomesoderm development in sea urchin (20) (TRANSC-SEAURCHIN N = 45, E = 83), and synaptic connections between neurons in C. elegans (NEURONS N = 280, E = 2170). (iii) WWW hyperlinks between Web pages in the www.nd.edu site (3) (WWW-1 N = 325729, E = 1469678), pages related to literary studies of Shakespeare (21) (WWW-2 N = 277114, E = 927400), and pages related to tango, specifically the music of Piazzolla (21) (WWW-3 N = 47870, E = 235441); and social networks, including inmates in prison (SOCIAL-1 N = 67, E = 182), sociology freshmen (22) (SOCIAL-2 N = 28, E = 110), and college students in a course about leadership (SOCIAL-3 N = 32, E = 96). (iv) Word-adjacency networks of a text in English (ENGLISH N = 7724, E = 46281), French (FRENCH N = 9424, E = 24295), Spanish (SPANISH N = 12642, E = 45129), and Japanese (JAPANESE N = 3177, E = 8300) and a bipartite model with two groups of nodes of sizes N1 = 1000 and N2 = 10 with probability of a directed or mutual edge between nodes of different groups being p = 0.06 and q = 0.003, respectively, and no edges between nodes within the same group (BIPARTITE N = 1010, E = 1261).

In the rate-limited network superfamily, long cascades and feedback loops are rare (16). Feedbacks are usually closed by protein-protein interactions and not by transcription (17). Purely transcriptional feedback loops may be rare because they are unstable and noisy due to their delays (18) or because they can be locked into an irreversible state (19), both undesirable properties for sensory transcription networks.

We find a distinct superfamily that includes three types of biological networks (12): signal-transduction interactions in mammalian cells based on the Signal Transduction Knowledge Environment (STKE) (12), developmental transcription networks that guide development in fruit fly (12) and sea-urchin (20), and synaptic wiring between neurons in Caenorhabditis elegans (12). These networks show triads 7, 9, and 10 with positive TSPs, and triads 1, 2, 4, and 5 with negative TSPs (Fig. 1). In contrast to the sensory transcription networks of microorganisms, these networks display two-node feedbacks that regulate or are regulated by a third node (triads 10, 9) and are less biased against cascades (triad 3). The common feature to this superfamily of information-processing networks is that the response time of each step in the network is usually much shorter than the response time required for the biological function of the network. Protein signal-transduction networks often need to respond within an hour or longer, but each interaction can take minutes or less. Cascade steps in developmental networks can have response times of tens of minutes, but the processes they control are much slower, on the order of animal (19) cell-division times that can take several hours. Neuronal networks in C. elegans typically need to respond within a second, but neuron response times are shorter than 100 ms. Thus, it appears that this superfamily characterizes biological information-processing networks that are not rate-limited.

Next, we analyzed three WWW networks of hyperlinks between Web pages related to university, literature, or music (3, 21). The TSPs were quite similar (c = 0.7 to 0.9). Triads 9, 10, 12, and 13 had the highest TSP values, and 4, 5, and 6 the lowest. The overrepresented triads have many transitive triplet interactions, where if xy and yz then xz (table S2). For example, the overrepresented triad 13, termed “clique,” has six transitive interactions, the highest transitivity possible in a triad. The less represented triads such as 6, 8, and 11 are highly intransitive.

We also analyzed three social networks, in which nodes represent people in a group, and edges represent positive sentiment directed from one group member to another, based on questionnaires (12, 22). The TSPs of the three social networks were very similar (c = 0.92 to 0.96). Notably, their TSPs were quite close to the TSP of the WWW nets (c = 0.7 to 0.9). This suggests that WWW networks and social networks may be part of a superfamily. The tendency of social networks to display transitive interactions and transitive triads is well established (23). The similarity of WWW and social networks suggests that classical models of social structural organization (24) may also be used to understand WWW structure.

Texts can also be represented as networks (25). We studied word-adjacency networks in which each node represents a word and a directed connection occurs when one word directly follows the other in the text. The TSPs of texts from different languages and of different sizes were similar (Fig. 1 compares texts in English, French, Spanish, and Japanese). The main feature was the underrepresentation of triangle-shaped triads 7 to 13. This is related to the structure of languages in which words belong to categories and a word from one category tends to be followed by one from a different category (26). For example, among the most connected words are prepositions, which are usually followed by nouns or articles. Figure 1 also shows the TSP for a model bipartite graph (12) in which nodes belong to two groups and connections are formed between these groups and not within the groups. The bipartite model graph shows a TSP that resembles that of the word-adjacency networks.

The similarities between networks can also be visualized by looking at the correlation between the TSPs of different networks (Fig. 2). The correlations can be used to cluster the networks into distinct superfamilies (27).

Fig. 2.

The correlation coefficient matrix of the triad significance profiles for the directed networks in Fig. 1.

The TSPs display certain conservation relations between subgraph types. For example, networks with excess triangle-shaped triads tend to have a deficit of V-shaped triads. We find that there are several triad conservation rules in networks that conserve the degree sequences of single and mutual edges (SOM Text). As a result, the 13 values of the TSP are determined by only seven degrees of freedom. One can intuitively interpret these conservation laws in terms of reactions that convert V-shaped subgraphs into triangles, preserving the degrees of all participating nodes. Analysis of the reactions occurring in each network allows a compact description of the difference between networks and their randomized counterparts (SOM Text).

We now consider nondirected networks, in which edges have no specified directionality. Because nondirected networks have only two types of triads (V and triangle), we analyzed the profile of the six types of nondirected connected tetrads (four-node subgraphs). Unlike triads, the normalized Z scores of tetrads show a significant dependence on the network size. Therefore, instead of an SP based on Z scores, we use the abundance of each subgraph i relative to random networks: Math where ϵ ensures that |Δ| is not misleadingly large when the subgraph appears very few times in both the real and random networks (here ϵ = 4). The subgraph ratio profile (SRP) is the vector of Δi normalized to length 1: Math A nondirected network representing the electrical power grid of the western United States (4) showed an SRP with overrepresentation of tetrads 3 to 6 (Fig. 3). Nondirected networks of protein structure in which nodes are secondary-structure elements (α helices and β strands) and two nodes are connected if their distance is smaller than 10 Å have similar SRPs with overrepresented tetrads 3, 5, and 6. We compared these networks to model networks in which connections are determined on a lattice by geometrical proximity. In these geometrically constrained networks, the nodes are arrayed on a lattice (a line in one dimension, a plane in two dimensions, etc). Points that are closer than a distance R on the lattice are linked by an edge with probability p. Points at a distance greater than R are unlinked. The resulting subgraph distributions of these networks and their corresponding randomized versions can be analytically calculated. We find good agreement between the real-world protein structure and power-grid SRPs and the corresponding geometrical models with a similar number of nodes, edges, and clustering coefficient (Fig. 3).

Fig. 3.

The subgraph ratio profile (SRP) for various nondirected networks. The networks are as follows (12): (i) The electrical power grid of the western United States (4) (POWERGRID N = 4941, E = 6594) and a geometric model with similar clustering coefficient (GEO-MODEL-PG N = 5000, E = 7499). (ii) Networks of secondary-structure elements adjacency for several large proteins [structure based on the PDB database (www.rcsb.org/pdb/); the proteins (and their PDB ID) were 1A4J, an immunoglobulin (PROTEIN-STRUCTURE-1 N = 95, E = 213); 1EAW, a serine protease inhibitor (PROTEIN-STRUCTURE-2 N = 53, E = 123); and 1AOR, an oxidoreductase (PROTEIN-STRUCTURE-3 N = 99, E = 212)] and a geometric model with similar clustering coefficient (GEO-MODEL-PS N = 53, E = 136). (iii) The Internet at the autonomous system level (www.cosin.org) (AUTONOMOUS-SYSTEMS 1 to 6; N = 3015, 3522, 4517, 5357, 7956, 10515; E = 5156, 6324, 8376, 10328, 15943, 21455). (iv) Networks grown according to the preferential attachment BA model (3) with m = 1 or m = 10 edges per new node (BA m = 1, 10; N = 1000, 3000, 1000, 3000; E = 1000, 3000, 9901, 29901).

A distinct family of SRPs was found for the Internet at the level of nondirected connections between autonomous systems (AS, which are groups of computers within which networking is handled locally, but between which data flows over the public Internet). We studied the structures of the AS network sampled at different time points from 1997 to 2001 (12). The SRP of the AS networks was similar despite their different sizes. We find that the SRP of these networks is very different from that of the geometrically constrained superfamily, with tetrads 2 to 4 underrepresented and tetrad 5 overrepresented.

Finally, we studied the preferential attachment model of Barabasi and Albert (BA) (28), which is widely used to study network evolution. In the BA model, a nondirected network is grown node by node, connecting each new node to m existing ones. The probability of connecting to an existing node i increases with the number of edges it already has. We find that the SRP of these networks (Fig. 3) has different forms for m = 1, m = 2, and high m (29). This is because not all tetrad patterns can be created when m = 1 or 2. The present approach can thus be used to study model networks (28) and allow comparison of their local structure to that of real-world systems.

In the SOM Text, we also show the SRP of tetrads for the directed networks considered above. We find that generally tetrad profiles of related networks are similar. However, networks of different types in the same triad superfamily sometimes show distinct tetrad profiles, suggesting that higher order subgraph profiles can help refine network classification.

The present approach demonstrates that networks of the same type share not only network motifs, but also characteristic SPs with very similar proportions of motifs and antimotifs significance. In addition, we find several superfamilies of networks that have similar SPs even though they describe different systems of vastly different sizes. What do the superfamilies mean? One possibility is that the similarity in SP is accidental and that distinct evolutionary histories can lead to a similar local structure. Another possibility is that the different systems in a superfamily have similar key circuit elements because they evolved to perform similar tasks. The latter possibility leads to intriguing hypotheses that connect networks from different disciplines. This can allow for a better understanding of a given network based on results from other networks in the same superfamily. It would be interesting to use the present approach to map the relation between the function and the local structure of real-world networks.

Supporting Online Material

www.sciencemag.org.org/cgi/content/full/303/5663/1538/DC1

SOM Text

Figs. S1 to S5

Tables S1 and S2

References

References and Notes

View Abstract

Navigate This Article