Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling

See allHide authors and affiliations

Science  04 Jul 2003:
Vol. 301, Issue 5629, pp. 102-105
DOI: 10.1126/science.1081900


The complexity of cellular gene, protein, and metabolite networks can hinder attempts to elucidate their structure and function. To address this problem, we used systematic transcriptional perturbations to construct a first-order model of regulatory interactions in a nine-gene subnetwork of the SOS pathway in Escherichia coli. The model correctly identified the major regulatory genes and the transcriptional targets of mitomycin C activity in the subnetwork. This approach, which is experimentally and computationally scalable, provides a framework for elucidating the functional properties of genetic networks and identifying molecular targets of pharmacological compounds.

Efforts to systematically define the organization and function of gene, protein, and metabolite networks include experimental and computational methods for identifying molecular interactions (13), global structural properties (4, 5), metabolic limits (6), and regulatory modules and characteristics (79). These methods have provided valuable insights in many applications, but they often provide only structural information or require extensive quantitative information, which is not generally available, particularly for larger regulatory networks. In previous computational studies (1012), alternative methods have been proposed that would enable rapid deduction of network connectivity and functional properties solely from temporal gene-expression data. However, the acquisition of adequate temporal expression data remains difficult, and the practical utility of such approaches has not been determined.

Here, we present a rapid and scalable method that enables construction of a first-order predictive model of a gene and protein regulatory network using only steady-state expression measurements and no previous information on the network structure or function. We use multiple linear regression to determine the model from RNA expression changes resulting from a set of steady-state transcriptional perturbations. The model can be used to identify the regulatory role of individual genes in the network, useful control points in the network, and genes that directly mediate a pharmaceutical compound's bioactivity in the cell. The method, called network identification by multiple regression (NIR), is derived from a branch of engineering called system identification (13), in which a model of the connections and functional relations between elements in a network is inferred from measurements of system dynamics (e.g., the response of genes and proteins to external perturbations).

To apply a system-identification method, we assume that the behavior of a gene, protein, and metabolite regulatory network can be modeled by a system of nonlinear differential equations (14, 15). Near a steady-state point (e.g., when gene expression does not change substantially over time), such a nonlinear system may be approximated to the first order by a linear system of equations describing the rate of accumulation of each network species resulting from a transcriptional perturbation: Math(1) where x is a vector representing the concentrations of N RNAs, proteins, and metabolites in the network; dx/dt represents the rate of accumulation of the species in x; u is a vector representing an external perturbation to the rate of accumulation of the species in x; and A, the network model, is an N × N matrix of coefficients describing the regulatory interactions between the species in x. Next, we identify the coefficients of A using only RNA expression changes that result from steadystate transcriptional perturbations. Because we measure RNA but not protein or metabolite species in this study, variables representing proteins and metabolites are not explicitly represented in the network model. Thus, regulatory connections in the model are not, in general, physical connections; rather, they represent effective functional relations between transcripts.

Under the steady-state assumption (dx/dt = 0), Eq. 1 reduces to Ax = –u. To identify the network model, we could, in principle, make N distinct perturbations, u, to the RNAs in a particular network, recover N sets of RNA concentrations, x, and solve directly for A (16). However, in larger networks it may be impractical to perform a full set of N perturbation experiments, and thus our problem would remain underdetermined. Even with a full set of perturbation experiments, RNA expression data are prone to high levels of measurement noise, making the direct solution unreliable. To overcome this problem, we assume that most biochemical networks are not fully connected (17, 18, 19), that is, some of the coefficients of A are zero. Thus, by assuming a maximum of k non-zero regulatory inputs to each gene (where k < N), we can transform our underdetermined problem into an overdetermined problem, making it robust both to measurement noise and incomplete data sets.

We next apply multiple linear regression (20) to calculate the model coefficients for each possible combination of k regulatory inputs (k coefficients) per gene. The k coefficients for each gene that fit the expression data with the smallest error are chosen as the best approximation of A. Using the standard errors on the RNA measurement data, the algorithm also computes the statistical significance of each recovered coefficient of A and the overall fit of A. A complete description of the algorithm is provided in the supporting online text.

We applied the NIR method to a nine-transcript subnetwork of the SOS pathway in E. coli (the “test network”). The SOS pathway, which regulates cell survival and repair after DNA damage, involves the lexA and recA genes, more than 30 genes directly regulated by lexA and recA, and tens or possibly hundreds of indirectly regulated genes (2125). We chose the nine transcripts in our test network (Fig. 1) to include the principal mediators of the SOS response (lexA and recA), four other regulatory genes with known involvement in the SOS response (ssb, recF, dinI, and umuDC), and three sigma factor genes (rpoD, rpoH, and rpoS) whose regulatory role in the SOS response is not fully understood. Because much of the regulatory structure of our test network has been previously mapped, it serves as an excellent subject for the validation of our method. In addition, it serves as an entry point for further study of the SOS pathway, which regulates genes associated with important protective pathways relevant to antibiotic resistance (23, 26).

Fig. 1.

Diagram of interactions in the SOS network. DNA lesions caused by mitomycin C (MMC) (blue hexagon) are converted to single-stranded DNA during chromosomal replication. Upon binding to ssDNA, the RecA protein is activated (RecA*) and serves as a coprotease for the LexA protein. The LexA protein is cleaved, thereby diminishing the repression of genes that mediate multiple protective responses. Boxes denote genes, ellipses denote proteins, hexagons indicate metabolites, arrows denote positive regulation, filled circles denote negative regulation. Red emphasis denotes the primary pathway by which the network is activated after DNA damage.

We applied a set of nine transcriptional perturbations to the test network in E. coli cells (27). In each perturbation, we overexpressed a different one of the nine genes in the test network with an arabinose-controlled episomal expression plasmid (fig. S1). We grew the cells in batch cultures under constant physiological conditions to their steady state (∼5.5 hours after the addition of arabinose). Cells were maintained in the exponential growth phase throughout all experiments. For all nine transcripts, we used quantitative real-time polymerase chain reaction (qPCR) to measure the change in expression relative to that in unperturbed cells. For each transcript, two qPCR reactions from each of eight replicate cultures were obtained, and qPCR data were filtered to eliminate aberrant or inefficient reactions (27). The mean expression changes for each transcript in each experiment (x in Eq. 1) were calculated (27), and only those changes that were greater than their standard error were accepted as significant and used for further analysis (that is, xi = 0 if |xi| < Sxi, where xi is the mean expression change and Sxi is the standard error for transcript i).

Using the nine-perturbation expression data set (the training set, tables S6 to S8) and the NIR algorithm described above, we solved Eq. 1 for A, the model of the regulatory interactions in the test network (table S1). The number of input connections per gene (k) was chosen such that the solved model provided a statistically significant fit (as determined by an F test), was dynamically stable, and provided the best balance between coverage and false-positives (27). To evaluate the performance of the algorithm, we determined the number of connections in the test network that were correctly resolved in the model, A. A resolved connection was considered correct if there exists a known RNA, protein, or metabolite pathway between the two transcripts and if the sign of the net effect of regulatory interaction (that is, activating or inhibiting) is correct, as determined by the currently known network in Fig. 1.

The algorithm correctly identified the key regulatory connections in the network. For example, the model correctly shows that recA positively regulates lexA and its own transcription, whereas lexA negatively regulates recA and its own transcription. In addition, the model correctly identified recA and lexA as having the greatest regulatory influence on the other genes in the test network (table S5). Overall, the performance (coverage and false-positives) of the NIR algorithm was equivalent to that expected on the basis of simulations of 50 random nine-gene networks (Fig. 2). Moreover, for the subnetwork of six genes typically considered part of the SOS network (recA, lexA, ssb, recF, dinI, and umuDC), the performance of the algorithm improved substantially. This suggests that some of the false-positives identified for the three sigma factors in our model (rpoD, rpoH, and rpoS) may be true connections mediated by genes not included in our test network. Furthermore, our simulation results suggest that even small reductions in the measurement noise observed in our experiments [mean noise level = mean(Sxi)/mean(xi) = 68%] could lead to substantial improvements in coverage and errors in the network model (Fig. 2). Reductions in experimental noise could be achieved with improved RNA measurement technologies such as competitive PCR coupled with matrix-assisted laser desorption/ionization–time-of-flight (MALDI-TOF) mass spectrometry (28).

Fig. 2.

NIR algorithm performance. (A) Coverage (correctly identified connections/total true connections) and (B) false-positives (incorrectly identified connections/total identified connections) were calculated for SOS models solved with a nine-perturbation training set (main panels) and a seven-perturbation training set (insets). Error bars are not included in the insets for clarity. Experiment (open triangles): Coverage and false-positives were calculated by comparing the solved model (table S1) to connections described in the literature (table S4 and Fig. 1). Because a nonsignificant fit was obtained for recF, the weights for inputs to recF were set to zero in the model. The mean noise observed on the mRNA measurements in our experiments was 68% (noise = Sxx, where Sx is the standard deviation of the mean of x, μx). Simulations (filled squares): Simulated perturbations were applied to 50 randomly connected networks of nine genes with an average of five regulatory inputs per gene. For each perturbation to each random network, the mRNA expression changes at steady state were calculated. The noise on the perturbations was set to 20%, equivalent to that observed on perturbations in our experiments. The noise on the mRNA concentrations was varied from 10 to 70%.

We also tested the performance of the NIR algorithm with an incomplete training set consisting of perturbations to only seven of the nine genes. We solved for network models using all 36 combinations of seven perturbations and found that the algorithm also performed comparably to simulations, albeit with slightly reduced performance in comparison with the full nine-perturbation training set (Fig. 2).

Much of the value of the network model lies in its predictive power, that is, its ability to predict expression changes and network behaviors that fall outside the training data set used to solve the model. Here, we demonstrate its predictive power by using it to distinguish the transcripts that are directly targeted by a pharmacological compound (the compound's mode of action) from transcripts that exhibit secondary responses to the expression changes of the direct targets. Thus, the direct targets represent the minimal subset of transcripts in the model that will produce the observed expression pattern if externally perturbed. Because proteins and metabolites are not measured in this study, the compound may not physically interact with transcripts identified as direct targets but instead may interact with protein or metabolite intermediates that are not explicitly represented in the network model.

To identify direct transcriptional targets of a compound, we first measure RNA expression changes (xp) resulting from treatment with the compound. The activity of the compound is treated as a set of unknown transcriptional perturbations (up) that produce the measured expression changes. From Eq. 1, we calculate the unknown perturbations as up = –Axp (27). The direct transcriptional targets of a compound are those that exhibit statistically significant values in up. Calculation of the statistical significance of up is described in the supporting online text.

We first applied our scheme to RNA expression changes that result from the simultaneous controlled perturbation of the lexA and recA genes. This perturbation might represent the effects of a hypothetical compound and serves as a well-defined input for validating the predictive power of our model. Although five of the nine test-network genes responded with statistically significant transcriptional changes (Fig. 3A), application of our network model correctly identified only lexA and recA as the perturbed genes (2/2 = 100% coverage, 7/7 = 100% specificity) (Fig. 3B).

Fig. 3.

Cells were perturbed either with a lexA-recA double perturbation or with MMC. The mean relative expression changes (x), normalized by their standard deviations (Sx), are illustrated for the lexA-recA double perturbation (A) and the MMC perturbation (C). Arrows indicate the genes known to be targeted by the perturbation. Predicted perturbations in the lexA-recA experiment (B) and the MMC experiment (D) were calculated from the expression data in (A) and (C) using the SOS model solved with the nine-perturbation training set (27). The predicted perturbations to each gene (u) were normalized by their standard deviations (Su) to determine statistical significance. In all panels, black bars indicate statistically significant and gray bars indicate statistically nonsignificant. Horizontal lines denote significance levels: P = 0.3 (dashed), P = 0.1 (solid).

We next applied a mitomycin C (MMC) perturbation to determine whether our scheme could identify the transcriptional targets of MMC bioactivity in the SOS network. Perturbed cells were grown in 0.75 μg/ml MMC, and transcriptional changes were measured relative to those in control cells grown in the normal baseline condition (0.5 μg/ml MMC). All genes in the test network showed statistically significant transcriptional increases (Fig. 3C). When we applied the network model to the expression data, we correctly identified recA as the transcriptional target of MMC bioactivity, with only one false-positive, umuDC (1/1 = 100% coverage, 7/8 = 88% specificity) (Fig. 3D). Moreover, recA was identified at a higher significance level (P ≤ 0.09) than was umuDC (P ≤ 0.22), suggesting that it is the more likely, if not the only, true target. It is also possible, however, that umuDC interacts with gene, protein, or metabolite targets of the compound that are not represented in our model. Therefore, umuDC may have been correctly identified as a target in our model. We also found that a model recovered with a seven-perturbation training set that excludes the lexA and recA training perturbations performs nearly as well as the model recovered with a full training set (see supporting online text and fig. S3).

The NIR method, a form of system identification based on multiple linear regression analysis of steady-state transcription profiles, provides a framework for rapidly elucidating the structure and function of genetic networks with no prior information. The method is robust to high levels of measurement noise, scalable for larger biochemical networks (27), and equally applicable to transcript, protein, and metabolite activity data. With advances in high-throughput measurement methods, it may soon be feasible to include protein and metabolite measurements on a large scale. The model recovered with this method enables the identification of key properties of the network, such as the major regulatory genes, and it provides a mechanism for efficiently identifying the mode of action of uncharacterized pharmacological compounds. These capabilities may facilitate optimization of cellular processes for biotechnology applications and the development of novel classes of therapeutic drugs that account for and utilize the complex regulatory properties of genetic networks.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 to S5

Tables S1 to S8


References and Notes

Stay Connected to Science

Navigate This Article