Report

A Probabilistic Functional Network of Yeast Genes

See allHide authors and affiliations

Science  26 Nov 2004:
Vol. 306, Issue 5701, pp. 1555-1558
DOI: 10.1126/science.1099511

Abstract

A conceptual framework for integrating diverse functional genomics data was developed by reinterpreting experiments to provide numerical likelihoods that genes are functionally linked. This allows direct comparison and integration of different classes of data. The resulting probabilistic gene network estimates the functional coupling between genes. Within this framework, we reconstructed an extensive, high-quality functional gene network for Saccharomyces cerevisiae, consisting of 4681 (∼81%) of the known yeast genes linked by ∼34,000 probabilistic linkages comparable in accuracy to small-scale interaction assays. The integrated linkages distinguish true from false-positive interactions in earlier data sets; new interactions emerge from genes' network contexts, as shown for genes in chromatin modification and ribosome biogenesis.

Knowledge of the correct overall structures of gene networks will be invaluable for characterizing the complex roles of individual genes and the interplay between the many systems in a cell. Deriving gene networks from heterogeneous functional genomics data, however, is often difficult, because experiments such as microarray analyses of gene expression (1) or systematic protein interaction mapping measure different aspects of gene or protein associations. Affinity purification of proteins analyzed by mass spectrometry (2, 3), for instance, measures the tendency for proteins to be components of the same physical complex, although not necessarily to contact each other directly. By contrast, yeast two-hybrid assays may often indicate direct physical interactions (stable or transient) between proteins (46), whereas synthetic lethal screens (7) measure the tendency for genes to compensate for the loss of other genes. Further, these analyses range considerably in accuracy (8), and it is not clear a priori which measurements are correct. In spite of these differences, these data sets can, in principle, be computationally integrated, primarily by the reconstruction of network models of the relations between genes (912). Such network reconstructions have largely focused on physical protein interactions and so represent only a subset of biologically important relations.

We sought to construct a more accurate and extensive gene network by considering functional, rather than physical, associations, realizing that each experiment, whether genetic, biochemical, or computational, adds evidence linking pairs of genes, with associated error rates and degree of coverage. In this framework, gene-gene linkages are probabilistic summaries representing functional coupling between genes. Only some of the links represent direct protein-protein interactions; the rest are associations not mediated by physical contact, such as regulatory, genetic, or metabolic coupling, that, nonetheless, represent functional constraints satisfied by the cell during the course of the experiments. Working with probabilistic functional linkages allows many diverse classes of experiments to be integrated into a single, coherent network (Fig. 1), which enables the linkages themselves to be more reliably established.

Fig. 1.

The method for integrating functional genomics data. Functional genomics data sets are first benchmarked for their relative accuracies; these are used as weights in a probabilistic integration of the data. Several raw data sets already have intrinsic scoring schemes, indicated in parentheses (e.g., CC, correlation coefficients; P, probabilities, and MI, mutual information scores). These data are rescored with LLS, then integrated into an initial network (IntNet). Additional linkages from the genes' network contexts (ContextNet) are then integrated to create the final network (FinalNet), with ∼34,000 linkages between 4681 genes (ConfidentNet) scoring higher than the gold standard (small-scale assays of protein interactions). Hierarchical clustering of ConfidentNet defined 627 modules of functionally linked genes spanning 3285 genes (“ModularNet”), approximating the set of cellular systems in yeast.

We first developed a unified scoring scheme for linkages, based on a Bayesian statistics approach. Each experiment is evaluated for its ability to reconstruct known gene pathways and systems by measuring the likelihood that pairs of genes are functionally linked conditioned on the evidence, calculated as a log likelihood score: Embedded Image where P(L|E) and ∼P(L|E) are the frequencies of linkages (L) observed in the given experiment (E) between annotated genes operating in the same pathway and in different pathways, respectively, whereas P(L) and ∼P(L) represent the prior expectations (i.e., the total frequency of linkages between all annotated yeast genes operating in the same pathway and operating in different pathways, respectively). Scores greater than zero indicate that the experiment tends to link genes in the same pathway, with higher scores indicating more confident linkages.

The log likelihood score can be interpreted as being proportional to the accuracy of the experiments and their ability to inform us about cellular pathways. Because each experiment is measured on a common benchmark, different experiments' scores are directly comparable, even when the natures of experiments are distinct (e.g., comparing genetic relations to physical interactions), and can be added to indicate confidence of combined evidence.

As scoring “benchmarks,” we tested the method against two primary annotation references: the Kyoto-based KEGG pathway database (13) and the experimentally observed yeast protein subcellular locations determined by genomewide green fluorescent protein (GFP)–tagging and microscopy (14). KEGG scores were used for integrating linkages, with the other benchmark withheld as an independent test of linkage accuracy. Cross-validated benchmarks and benchmarks based on the Gene Ontology (GO) (15) and KOG gene annotations (16) provided comparable results (17).

Seven large-scale yeast protein interaction experiments, including small-scale protein interaction assays collected from the Database of Interacting Proteins (DIP) (18), high-throughput mass spectrometry (2, 3), yeast two-hybrid (46), and synthetic lethal assays (7), showed similar rankings of accuracy across the four benchmark tests (Fig. 2; fig. S8, A and B). These tests indicate that small-scale experiments (our “gold standard” for high accuracy linkages) have been the most accurate of all, whereas the large-scale experiments vary considerably in quality. Even the least accurate experiments score better than random linkages (for which LLS = 0), highlighting the merit of this method: weak evidence from multiple sources can be combined to provide strong overall evidence for a linkage.

Fig. 2.

Benchmarked accuracy and extent of functional genomics data sets and the integrated networks. A critical point is the comparable performance of the networks on distinct benchmarks, which assess the tendencies for linked genes to share (A) KEGG pathway annotations (13) or (B) protein subcellular locations (14). Each x axis indicates the percentage of protein-encoding yeast genes provided with linkages by the plotted data; each y axis indicates relative accuracy, measured as the agreement of the linked genes' annotations on that benchmark. The gold standards of accuracy (red star) for calibrating the benchmarks are small-scale protein-protein interaction data from DIP (18). Colored markers indicate experimental linkages; gray markers, computational. The initial integrated network (lower black line), trained using only the KEGG benchmark, has measurably higher accuracy than any individual data set on the subcellular localization benchmark; adding context-inferred linkages in the final network (upper black line) further improves the size and accuracy of the network [see (17) for additional benchmarks].

Functional linkages were first inferred on the basis of genes' mRNA coexpression across each of 12 sets of DNA microarray experiments (497 microarray experiments in total), then integrated via a rank-weighted sum of log likelihood scores (17) to create the combined set of coexpression-derived linkages. To construct the initial integrated network (“IntNet,” Fig. 1), we combined eight categories of data, including the physical and genetic interaction data sets, mRNA coexpression linkages, functional linkages from literature mining (17), and computational linkages from two comparative genomics methods, Rosetta stone (gene-fusion) linkages (19, 20) and phylogenetic profiles (21). Integrating functional genomics data also allowed discovery of additional relations between genes linked, in turn, to a common set of genes [“ContextNet” (17, 2225)]; these linkages were scored and integrated as above to construct the final gene network (“FinalNet,” Fig. 3A). The final network has ∼34,000 linkages at an accuracy comparable to the gold standard small-scale interaction assays (Fig. 2), which provides linkages (“ConfidentNet” for more than 4681 yeast genes (∼81% of the yeast proteome). The network is reasonably distinct from networks of physical interacting proteins [e.g., sharing only ∼16% of linkages with (11); see (17)].

Fig. 3.

Features of integrated networks. The final network shows extensive clustering of genes into modules, evident in the “clumping” (A). At an intermediate degree of clustering that maximizes cluster size and functional coherence (B), 564 (of 627) modules are shown connected by the 950 strongest intermodule linkages. Module colors and shapes indicate associated functions, as defined by Munich Information Center for Protein Sequencing (MIPS) (34), with sizes proportional to the number of genes, and connections inversely proportional to the fraction of genes linking the clusters. Portions of the final, confident gene network are shown for (C) DNA damage response and/or repair, where modularity gives rise to gene clusters, indicated by similar colors (see also fig. S13), and (D) chromatin remodeling, with several uncharacterized genes (red labels). Networks are visualized with Large Graph Layout (LGL) (35).

Adding context-inferred linkages increased clustering of genes (fig. S7, C and D), which produced a highly modular gene network with well-defined subnetworks. We expected these gene clusters to reflect gene systems and modules (2630). We could therefore generate a simplified view of the major trends in the network (Fig. 3B) by clustering genes of ConfidentNet according to their connectivities (17). Of the 4681 genes, 3285 (∼70.2%) were grouped into 627 clusters, reflecting the high degree of modularity. Genes' functions within each cluster are highly coherent (fig. S12), and with 2 to 154 genes per cluster (∼5 genes per cluster on average), the clusters effectively capture typical gene pathways and/or systems. A region of the modular network centered on the DNA damage response and repair systems is shown in Fig. 3C. The network is clearly hierarchical: Individual clusters represent distinct systems related to DNA damage response and/or repair; these clusters are in turn connected to modules of cell cycle regulatory genes and chromatin silencing (fig. S13), functionally linked to the DNA damage response and/or repair system. [For cluster descriptions and interactive three-dimensional visualizations, see (17).]

One can infer individual genes' functions on the basis of linked neighbors. For example, seven uncharacterized genes are implicated in chromatin remodeling (Fig. 3D). All but 1 of the 18 linkages made by these genes arise from the comparative genomics analysis or from the network context methods, which represent examples of the insights that arise only after data integration. Three of the uncharacterized proteins are predicted by sequence homology to have helicase activity, which is reasonable for a relation to chromatin remodeling; four of these proteins localize to the nucleus, further supporting their association. After this network's construction, one gene, VID21, was implicated in chromatin modification as a component of the NuA4 histone acetyl transferase.

The function of the RNA helicase PRP43, previously thought to be involved only in pre-mRNA splicing and implicated in lariatintron release from the spliceosome (31), is also clarified in the network. PRP43 is linked most strongly to genes of ribosome biogenesis and rRNA processing. The tightest links are to ERB1, RRB1, NUG1, LHP1, and PWP1, the first three of which are confirmed ribosome biogenesis factors. These links derive only from the coexpression and context methods [with a single exception from (3)]; data integration is therefore critical. The association of PRP43 with ribosome biogenesis has now been experimentally validated (32): the growth defect conferred by a PRP43 conditional lethal mutation corresponds to a rapid and major defect in rRNA processing. These data indicate that rRNA processing is the essential function of PRP43, and it joins a growing group of RNA helicases with two or more distinct functions.

The probabilistic gene network we describe integrates evidence from diverse sources to reconstruct an accurate network, by estimating the functional coupling among yeast genes, and provides a view of the relations between yeast proteins distinct from their physical interactions. The application of this strategy to other organisms, such as to the human genome, is conceptually straightforward: (i) assemble benchmarks for measuring the accuracy of linkages between human genes based on properties shared among genes in the same systems, (ii) assemble gold standard sets of highly accurate interactions for calibrating the benchmarks, and (iii) benchmark functional genomics data for their ability to correctly link human genes, then integrate the data as described. New data can be incorporated in a simple manner [e.g., see (33)], serving to reinforce the correct linkages. Thus, the gene network will ultimately converge by successive approximation to the correct structure simply by continued addition of functional genomics data in this framework.

Supporting Online Material

www.sciencemag.org/cgi/content/full/306/5701/1555/DC1

Materials and Methods

SOM Text

Figs. S1 to 14

Tables S1 to S4

Supporting Data S1 to S5

References and Notes

View Abstract

Stay Connected to Science

Navigate This Article