## Resolving a network of hubs

Graphs are a pervasive tool for modeling and analyzing network data throughout the sciences. Benson *et al.* developed an algorithmic framework for studying how complex networks are organized by higher-order connectivity patterns (see the Perspective by Pržulj and Malod-Dognin). Motifs in transportation networks reveal hubs and geographical elements not readily achievable by other methods. A motif previously suggested as important for neuronal networks is part of a “rich club” of subnetworks.

## Abstract

Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks—at the level of small network subgraphs—remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns.

Networks are a standard representation of data throughout the sciences, and higher-order connectivity patterns are essential to understanding the fundamental structures that control and mediate the behavior of many complex systems (*1*–*7*). The most common higher-order structures are small network subgraphs, which we refer to as network motifs (Fig. 1A). Network motifs are considered building blocks for complex networks (*1*, *8*). For example, feed-forward loops (Fig. 1A, *M*_{5}) have proven fundamental to understanding transcriptional regulation networks (*9*); triangular motifs (Fig. 1A, *M*_{1}–*M*_{7}) are crucial for social networks (*4*); open bidirectional wedges (Fig. 1A, *M*_{13}) are key to structural hubs in the brain (*10*); and two-hop paths (Fig. 1A, *M*_{8}–*M*_{13}) are essential to understanding air traffic patterns (*5*). Although network motifs have been recognized as fundamental units of networks, the higher-order organization of networks at the level of network motifs largely remains an open question.

Here, we use higher-order network structures to gain new insights into the organization of complex systems. We develop a framework that identifies clusters of network motifs. For each network motif (Fig. 1A), a different higher-order clustering may be revealed (Fig. 1B), which means that different organizational patterns are exposed, depending on the chosen motif.

Conceptually, given a network motif *M*, our framework searches for a cluster of nodes *S* with two goals. First, the nodes in *S* should participate in many instances of *M*. Second, the set *S* should avoid cutting instances of *M*, which occurs when only a subset of the nodes from a motif are in the set *S* (Fig. 1B). More precisely, given a motif *M*, the higher-order clustering framework aims to find a cluster (defined by a set of nodes *S*) that minimizes the following ratio:(1)where denotes the remainder of the nodes (the complement of *S*), cut* _{M}*(

*S*,) is the number of instances of motif

*M*with at least one node in

*S*and one in , and vol

*(*

_{M}*S*) is the number of nodes in instances of

*M*that reside in

*S*. Equation 1 is a generalization of the conductance metric in spectral graph theory, one of the most useful graph partitioning scores (

*11*). We refer to φ

*(*

_{M}*S*) as the motif conductance of

*S*with respect to

*M*.

Finding the exact set of nodes *S* that minimizes the motif conductance is computationally infeasible (*12*). To approximately minimize Eq. 1 and, hence, to identify higher-order clusters, we developed an optimization framework that provably finds near-optimal clusters [supplementary materials (*13*)]. We extend the spectral graph clustering methodology, which is based on the eigenvalues and eigenvectors of matrices associated with the graph (*11*), to account for higher-order structures in networks. The resulting method maintains the properties of traditional spectral graph clustering: computational efficiency, ease of implementation, and mathematical guarantees on the near-optimality of obtained clusters. Specifically, the clusters identified by our higher-order clustering framework satisfy the motif Cheeger inequality (*14*), which means that our optimization framework finds clusters that are at most a quadratic factor away from optimal.

The algorithm (illustrated in Fig. 1C) efficiently identifies a cluster of nodes *S* as follows:*•* Step 1: Given a network and a motif *M* of interest, form the motif adjacency matrix *W _{M}* whose entries (

*i*,

*j*) are the co-occurrence counts of nodes

*i*and

*j*in the motif

*M*: (

*W*)

_{M}*= number of instances of*

_{ij}*M*that contain nodes

*i*and

*j.*(2)

*•*Step 2: Compute the spectral ordering σ of the nodes from the normalized motif Laplacian matrix constructed via

*W*(

_{M}*15*).

*•*Step 3: Find the prefix set of σ with the smallest motif conductance (the argument of the minimum), formally,

*S*:= arg min

*φ*

_{r}*(*

_{M}*S*), where

_{r}*S*= {σ

_{r}_{1}, …, σ

*}.*

_{r}For triangular motifs, the algorithm scales to networks with billions of edges and, typically, only takes several hours to process graphs of such size. On smaller networks with hundreds of thousands of edges, the algorithm can process motifs up to size 9 (*13*). Although the worst-case computational complexity of the algorithm for triangular motifs is Θ(*m*^{1.5}), where *m* is the number of edges in the network, in practice, the algorithm is much faster. By analyzing 16 real-world networks where the number of edges *m* ranges from 159,000 to 2 billion, we found the computational complexity to scale as Θ(*m*^{1.2}). Moreover, the algorithm can easily be parallelized, and sampling techniques can be used to further improve performance (*16*).

The framework can be applied to directed, undirected, and weighted networks, as well as motifs (*13*). Moreover, it can also be applied to networks with positive and negative signs on the edges, which are common in social networks (friend versus foe or trust versus distrust edges) and metabolic networks (edges signifying activation versus inhibition) (*13*). The framework can be used to identify higher-order structure in networks where domain knowledge suggests the motif of interest. In the supplementary materials, we also show that when a domain-specific higher-order pattern is not known in advance, the framework can also serve to identify which motifs are important for the modular organization of a given network (*13*). Such a general framework allows complex higher-order organizational structures in a number of different networks by using individual motifs and sets of motifs. The framework and mathematical theory immediately extend to other spectral methods, such as localized algorithms that find clusters around a seed node (*17*) and algorithms for finding overlapping clusters (*18*). To find several clusters, one can use embeddings from multiple eigenvectors and *k*-means clustering (*13*, *19*) or can apply recursive bipartitioning (*13*, *20*).

The framework can serve to identify a higher-order modular organization of networks. We apply the higher-order clustering framework to the *Caenorhabditis elegans* neuronal network, where the four-node “bi-fan” motif (Fig. 2A) is overexpressed (*1*). The higher-order clustering framework then reveals the organization of the motif within the *C. elegans* neuronal network. We find a cluster of 20 neurons in the frontal section with low bi-fan motif conductance (Fig. 2B). The cluster shows a way that nictation is controlled. Within the cluster, ring motor neurons (RMEL, -V, or -R), proposed pioneers of the nerve ring (*21*), propagate information to inner labial sensory neurons, regulators of nictation (*22*), through the neuron RIH (Fig. 2C). Our framework contextualizes the importance of the bi-fan motif in this control mechanism.

The framework also provides new insights into network organization beyond the clustering of nodes based only on edges. Results on a transportation reachability network (*23*) demonstrate how it finds the essential hub interconnection airports (Fig. 3). These appear as extrema on the primary spectral direction (Fig. 3C) when two-hop motifs (Fig. 3A) are used to capture highly connected nodes and nonhubs. [The first spectral coordinate of the normalized motif Laplacian embedding was positively correlated with the airport city’s metropolitan population with Pearson correlation 99% confidence interval (0.33, 0.53).] The secondary spectral direction identified the west-east geography in the North American flight network [it was negatively correlated with the airport city’s longitude with Pearson correlation 99% confidence interval (–0.66, –0.50)]. On the other hand, edge-based methods conflate geography and hub structure. For example, Atlanta, a large hub, is embedded next to Salina, a nonhub, with an edge-based method (Fig. 3D).

Our higher-order network clustering framework unifies motif analysis and network partitioning—two fundamental tools in network science—and reveals new organizational patterns and modules in complex systems. Prior efforts along these lines do not provide worst-case performance guarantees on the obtained clustering (*24*) and do not reveal which motifs organize the network (*25*) but rely on expanding the size of the network (*26*, *27*). Theoretical results in the supplementary materials (*13*) also explain why classes of hypergraph partitioning methods are more general than previously assumed and how motif-based clustering provides a rigorous framework for the special case of partitioning directed graphs. Finally, the higher-order network clustering framework is generally applicable to a wide range of network types, including directed, undirected, weighted, and signed networks.

## Supplementary Materials

www.sciencemag.org/content/353/6295/163/suppl/DC1

Materials and Methods

Supplementary Text

Figs. S1 to S13

Tables S1 to S12

## References and Notes

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵Minimizing ϕ
(_{M}*S*) is nondeterministic polynomial-time hard (NP-hard), which follows from the NP-hardness of the traditional definition of conductance (*28*). - ↵See the supplementary materials on
*Science*Online. - ↵Formally, when the motif has three nodes, the selected cluster
*S*satisfies , where is the smallest motif conductance of any possible node set*S*. This inequality is proved in the supplementary materials. - ↵The normalized motif Laplacian matrix is
*L*=_{M}*D*^{−1/2}(*D*− W)_{M}*D*^{−1/2}, where*D*is a diagonal matrix with the row-sums of*W*on the diagonal [_{M}*D*= Σ_{ii}(_{j}*W*)_{M}], and_{ij}*D*^{−1/2}is the same matrix with the inverse square roots on the diagonal . The spectral ordering σ is the by-value ordering of*D*^{−1/2}*z*, where*z*is the eigenvector corresponding to the second smallest eigenvalue of*L*, i.e., σ_{M}is the index of_{i}*D*^{−1/2}*z*with the*i*th smallest value. - ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

**Acknowledgments:**The authors thank R. Sosič for insightful comments. A.R.B. was supported by a Stanford Graduate Fellowship; D.F.G. was supported by NSF (CCF-1149756 and IIS-1422918), J.L. was supported by NSF (IIS-1149837 and CNS-1010921), trans-NIH initiative Big Data to Knowledge (BD2K), Defense Advanced Research Projects Agency [XDATA and Simplifying Complexity in Scientific Discovery (SIMPLEX)], Boeing, Lightspeed, and Volkswagen. Software implementations and the data sets used to obtain the results in this manuscript are available at http://snap.stanford.edu/higher-order/