Research Article

Co-regulatory networks of human serum proteins link genetics to disease

See allHide authors and affiliations

Science  24 Aug 2018:
Vol. 361, Issue 6404, pp. 769-773
DOI: 10.1126/science.aaq1327

The blood proteome in disease

Understanding the function of human blood serum proteins in disease has been limited by difficulties in monitoring their production, accumulation, and distribution. Emilsson et al. investigated human serum proteins of more than 5000 Icelanders over the age of 65. The composition of blood serum includes a complex regulatory network of proteins that are globally coordinated across most or all tissues. The authors identified modules and functional groups associated with disease and health outcomes and were able to link genetic variants to complex diseases.

Science, this issue p. 769

Abstract

Proteins circulating in the blood are critical for age-related disease processes; however, the serum proteome has remained largely unexplored. To this end, 4137 proteins covering most predicted extracellular proteins were measured in the serum of 5457 Icelanders over 65 years of age. Pairwise correlation between proteins as they varied across individuals revealed 27 different network modules of serum proteins, many of which were associated with cardiovascular and metabolic disease states, as well as overall survival. The protein modules were controlled by cis- and trans-acting genetic variants, which in many cases were also associated with complex disease. This revealed co-regulated groups of circulating proteins that incorporated regulatory control between tissues and demonstrated close relationships to past, current, and future disease states.

Human serum contains a dynamic flux of proteins synthesized by tissues and cells of the body (1). The secretome is complex and likely involves 15% or more of all proteins (2). Secreted proteins and circulating blood cells mediate global homeostasis via intercellular communication, immune responses, vascular and endothelial cell function, tissue remodeling, fluid exchange, and nutrient assimilation (3). Defined functional roles for many individual proteins in circulation remains to be ascribed owing to our limited ability to monitor their production, accumulation, and distribution in both model systems and humans.

Heterochronic parabiosis experiments that surgically joined the circulation of young and old mice showed a system-wide effect on the regenerative capacity of organs (4, 5). Thus, serum proteins and other circulating factors may directly regulate complex processes such as aging and the development of common chronic diseases. In contrast to monogenic diseases, complex diseases are caused not by proteins acting alone but instead by highly interacting protein networks that may result from genetic and environmental perturbations and ultimately drive physiological states toward disease (69). Because blood mediates coordination between nonadjacent tissues, it is of the highest interest to understand if and how this regulation occurs via serum proteins and their networks.

A custom-designed aptamer-based multiplex proteomic platform

To date, high-throughput detection and quantification of serum proteins in a large human population have been hampered by the limitations of proteomic profiling technologies. The Slow-Off rate Modified Aptamer (SOMAmer)–based technology has emerged as a proteomic profiling platform (SOMAscan) with high sample throughput and sensitivity of detection (10, 11). We designed an expanded custom version of this platform to include proteins known or predicted to be found in the extracellular milieu, including the predicted extracellular domains of single- and certain multipass transmembrane proteins. This resulted in an updated array of 5034 SOMAmers, 4783 of which recognize 4137 individual human proteins (table S1)—i.e., some proteins are targeted by more than one aptamer, whereas the rest recognize nonhuman targets. We applied the platform to a large population-based sampling of 5457 participants in the AGES Reykjavik study (12), a prospective study of deeply phenotyped and genotyped individuals older than 65 years of age. Table S2 reports baseline characteristics of the study population, while fig. S1 shows the workflow of the present study. For direct and inferential measures of aptamer specificity, see tables S3 to S6 and figs. S2 and S3.

Identification and characterization of serum protein networks

Biological networks are characterized by a nonrandom distribution of links between objects that are scale-free in nature (13). We reconstructed the protein co-regulation network using weighted gene-to-gene coexpression analysis (14). Co-regulation networks describe functional relationships that can reflect both physical and nonphysical interactions between objects, including proteins. The parameters used for constructing the serum protein network were chosen on the basis of the topological scale-free criterion (14) and tuned to enhance signal-to-noise ratio in the protein adjacency matrix (Fig. 1A and fig. S4, A and B). This analysis established that serum proteins cluster into 27 highly structured co-regulatory modules ranging in size between 20 and 921 proteins (table S7), whereas 15% fell outside of these modules. We examined the preservation of the network architecture by randomly splitting the AGES cohort into training and test sets, and applying a suite of statistics as previously described (15). We observed strong preservation of the overall structure of the network (Fig. 1B). Permutation testing of the data indicated that these modules were unlikely to have occurred by chance (fig. S4C). Pairwise correlation of proteins could be controlled at any of the many steps from transcription to secretion and clearance, and here we use the term “co-regulation” to encompass all such regulation between proteins.

Fig. 1 The serum protein network structure.

(A) Hierarchical clustering dendrogram using dynamic tree cut (29), revealing 27 serum protein modules. Each branch of the dendrogram represents a single protein, and the colored bar below denotes its corresponding protein module, as annotated in the legend to the right. The dendrogram height is the distance between proteins (14). (B) The cohort was randomly split into two equal parts, one for a training set and another for the test set, and a summary Z score statistics (15), plotted for each module presented as colored data points. The summary Z score <2 (blue dotted line) indicates no preservation; 2< summary Z score <10 (between the blue and green dotted lines) indicates moderate evidence of preservation; and a summary Z score >10 (green dotted line) indicates strong evidence of preservation. See fig. S10 for the ⅔ versus ⅓ split of the cohort.

We applied various annotation tools and found that the modules were enriched for distinct functional and tissue-specific signatures (table S8). This suggests that, to some degree, peripheral proteins cluster according to their function and/or tissue of origin. Individual modules frequently contained proteins from many distinct tissues, indicating that individual tissues were not the sole contributor to any module (tables S9 and S10). Indeed, comparison of the serum protein modules to 2672 coexpression mRNA modules constructed from solid tissues (16) revealed some, but largely an insignificant, agreement between the two (fig. S5), suggesting that serum protein co-regulation is distinct from that of most tissues. Thus, the human serum proteome appeared as functionally distinct modules of proteins produced by many tissues of the body.

We characterized each module’s eigenprotein [E(q), where q denotes a module] through a singular value decomposition and transformation of the variable protein levels for any given module. Each E(q) is a unique representation that most closely reflects the collective behavior of that module, explaining on average 40% of the protein covariance. Some modules’ E(q)s were more correlated to others and formed several superclusters of modules with shared functional categories (table S11 and fig. S6). We assessed if the modules were related to disease status of the AGES donors. Association of module E(q)s to various outcomes—including coronary heart disease (CHD), heart failure (HF), type 2 diabetes (T2D), visceral and subcutaneous adipose tissue (VAT and SAT, respectively), and metabolic syndrome (MetS)—were found (table S12). Whereas some modules showed no association to any disease measure, others were associated to some or all of them (table S12; Fig. 2, A to K; and fig. S7, A to I). Notably, superclusters, although internally consistent, often showed opposing relationships between superclusters to CHD-related outcomes (table S12; Fig. 2, E, H, and K; and fig. S7, B, E, and H), which may reflect differential roles of the protein modules in disease. The two modules PM1 and PM23 associated with T2D were also associated with MetS-, VAT-, SAT-, and CHD-related outcomes in a directionally consistent manner (table S12 and Fig. 2, B and K). Finally, modules associated with incident disease and to all-cause or post-CHD mortality (table S12; Fig. 2, C, F, and I; and fig. S7, C, F, and I) indicate that the protein network predicted future events and disease outcome. Time between diagnosis and sample collection had no effect on the association of individual proteins to prevalent disease (fig. S8). Thus, individual disease associations of the protein modules link the shared functions of many proteins to common diseases.

Fig. 2 The relationship between module’s E(q) to disease-related measures.

(A) The module PM1 is a single cluster of 31 proteins. (B) Positive associations of E(PM1) quintiles to variation (cm2) in visceral adipose tissue (VAT), incident coronary heart disease (inc CHD), type 2 diabetes (T2D), and the metabolic syndrome (MetS), ***P < 1 × 10−10. (C) Overall survival, i.e., with respect to all-cause mortality, was reduced for high E(PM1) levels (red curve) compared to low E(PM1) levels (cyan curve). (D) The modules PM6, PM9, and PM10 are members of supercluster II. (E) Inverse association of the E(PM6) and E(PM9) to prevalent CHD (prev CHD) and prevalent heart failure (prev HF), ***P ≤ 1 × 10−9. (F) Reduced overall survival for low E(PM10) levels (cyan curve) compared to high E(PM10) levels (red curve). (G) The PM17 is in supercluster IV. (H) Positive association of module’s E(PM17) to incident CHD and HF as well as prevalent CHD and HF,***P < 1 × 10−17 . (I) Reduced postincident CHD survival as well as overall survival for high E(PM17) levels (red curve) compared to low E(PM17) levels (cyan curve). (J) The module PM23 is a member of supercluster V. (K) Positive associations of the module E(PM23) quintiles to VAT and subcutaneous adipose tissue (SAT), and prevalent CHD, T2D, and MetS, ***P < 1 × 10−13. Data were analyzed using forward linear or logistic regression or Cox proportional hazards regression, depending on the outcome being continuous, binary, or a time to an event. Kaplan-Meier plots were used to display survival probabilities. The number of proteins per module is denoted at the branches of the dendrogram. Controls are individuals free of the disease in question.

The networks were scale-free in nature where a few protein nodes were highly connected. These are often referred to as network hubs, which organize network connectivity and information flow (17). Studies of gene networks from solid tissues suggest that hub nodes tend to be essential and evolutionarily conserved (18). The scaled intramodule connectivity Embedded Image of proteins was compared to various outcomes, where protein hubs showed stronger association to phenotypic measures than less well connected proteins (Fig. 3, A to L, and fig. S9, A to I). We observed strong reproducibility of the hub status of proteins through the network preservation analysis (fig. S10). In summary, the hub proteins consistently showed stronger association to various disease-related outcomes compared to proteins located in the periphery of the network. Thus, the structure of the protein network derived solely from measured protein variation was aligned with disease and phenotype variation across donors.

Fig. 3 The relationship between connectivity of proteins and disease-related measures.

(A) Circle graph of PM1 highlighting the hub protein CMPK1. (B) Positive correlation between within module connectivity (Ki) (x axis) and the absolute value of the effect size of the association of proteins to type 2 diabetes (T2D) (y axis). (C) Positive association of CMPK1 to T2D, and reduced overall survival associated with high serum CMPK1 levels (red curve). (D) Spring graph of PM9, highlighting the hub SYTL4. (E) Positive correlation between Ki and the association of proteins to prevalent heart failure (prev HF). (F) Inverse association of SYTL4 to HF, P < 1 × 10−30, and reduced overall survival associated with low serum SYTL4 levels (cyan curve). (G) A circle graph of PM17 highlighting the hub protein SUMO3. (H) Positive correlation between the Ki of proteins and their association to prevalent HF. (I) SUMO3 is positively associated with prevalent HF, and high levels of SUMO3 (red curve) predict reduced survival postincident CHD. (J) A circle graph of the PM23 highlighting the hub ACY1. (K) Positive correlation between the Ki of proteins and their association to both MetS and T2D. (L) Strong positive association of ACY1 to MetS. Network visualization was performed with the igraph package in R (30). Pearson’s r was estimated for correlation between Ki of proteins and their strength of association to disease measures. See Fig. 2 for other relevant statistics.

Genetic variants influence serum protein levels

Genome-wide association studies (GWAS) have identified thousands of common DNA sequence variants affecting human diseases (19). By integrating GWAS signals with genetic variants that affect intermediate traits like mRNA and/or protein, the identification of the causal candidates and pathways can be enhanced (69). Protein single-nucleotide polymorphisms (pSNPs) are DNA sequence variants associated with allelic imbalance in protein levels. Using Bonferroni correction, we identified 1046 significant cis pSNP–protein associations within a 300-kb window across the corresponding protein-coding sequences (table S13 and fig. S11A). Cross-referencing cis pSNP-protein pairs to cis eSNPs (expression SNPs)–transcript data from more than 30 tissues and cells (20) revealed an overlap of 37.3% (table S14). This suggests that ∼60% of the genetic effects on serum protein levels are mediated either by an as yet unknown transcriptional effect and/or by posttranscriptional mechanisms. The cis pSNP-proteins were underrepresented among highly connected protein nodes (fig. S11, B and C), consistent with the observation that there was reduced selective pressure on non–hub proteins (21). We observed examples of cis protein effects likely underlying reported GWAS risk loci for T2D, adiposity, and/or CHD (fig. S12A and tables S15 and S16).

We also found that various lead GWAS SNPs mediate distal trans effects on serum proteins (figs. S12, B to D, and S13A to E, and table S16). Many of these risk loci influenced proteins in both cis and trans, including the risk locus rs579459 for CHD upstream of ABO (fig. S12D and table S16), of which many were overrepresented in the PM27 module (fig. S12E). Genetic pleiotropy is a well-established phenomenon at the ABO locus (fig. S13G) (22). Indeed, the ABO blood groups have been linked to many diseases, which is in part mediated through their effect on the prothrombotic risk factor vWF (von Willebrand factor) (23, 24), a protein also affected by rs579459 in trans (table S16). In light of these results, all cis-acting pSNPs were tested directly for trans effects on proteins, revealing that 16% of all cis pSNPs affected in total 911 proteins in trans (table S17). Notably, twice as many, or 40.7% of cis pSNPs affecting proteins in trans, matched reported GWAS loci, compared to 20.7% among all cis pSNPs (tables S15 and S17), suggesting a link between trans protein regulation and disease variation. Finally, we confirmed on average 80% of previously reported cis and 74% of trans effects on plasma proteins in our dataset (table S18 and fig. S14). The finding that GWAS risk loci tend to regulate many proteins in trans and that these proteins cluster in the same module raised the possibility that common DNA sequence variants determine the architecture of the serum protein network.

Linking genetic variants to serum protein networks

Concerns have been expressed regarding the adequacy of GWAS to identify useful disease links, resulting in a call for greater efforts in identifying the gene regulatory networks that integrate the numerous GWAS signals (25). Given the hierarchical organization of the serum protein network and our observation that trans co-regulated proteins cluster in modules, it is possible that common cis- and trans-acting pSNPs underlie the structure of the serum protein network. We adopted a standard single-point GWAS test of linear regression against all module E(q)s using a threshold of P ≤ 5 × 10−8 for detection of genome-wide significant effects. Even though the E(q) represented many proteins as a complex trait, we often observed strong single SNP associations to the modules E(q)s (table S19), which we have termed network-associated protein SNPs (npSNPs), of which many were associated with more than one module consistent with their relationship. Given that npSNPs were associated with E(q)s of distinct protein modules, npSNPs may act via individual proteins in circulation (tables S19 and S20). For instance, the npSNP rs704 (NP_000629.3: p.Thr400Met) in VTN affected VTN in cis (Fig. 4, A to C, and fig. S13H), and 698 proteins in trans that were significant components of the modules enriched for immune-related pathways (Fig. 4, D and E, and table S20). Furthermore, distinct loci at APOE or BCHE were associated with the lipoprotein-enriched module PM11 (Fig. 5A, fig. S12I, and tables S8, S19, and S20), exerting cis effects on APOE and BCHE (Fig. 5B) and affecting 64 proteins in trans or 89% of all proteins in PM11 (Fig. 5, C and D, and tables S19 and S20). These results highlight the genetic architecture of the serum protein network and show that the modules and disease variation are intimately connected.

Fig. 4 The network-associated pSNP rs704 regulates modules related to immune functions.

(A) A circular Manhattan plot highlights the GWAS results for E(PM7), revealing a single highly significant association at rs704, a missense variant (NP_000629.3: p.Thr400Met) in VTN. Four other modules within supercluster II were affected by rs704 (table S20). (B) The rs704 variant, and many other linked SNPs in the region, exert a strong cis-acting effect on VTN within the 300-kb genomic region across the VTN gene. The black triangle demonstrates linkage disequilibrium (r2) patterns in the region derived from the AGES cohort data. (C) The bimodal population distribution of the VTN protein is explained by the drastic reduction in VTN levels in individuals homozygous for the rs704 C minor allele. (D) A schematic presentation of the single cis and many trans effects mediated by the rs704 in VTN on serum proteins. (E) Proteins affected by rs704 in trans (and VTN in cis) cluster in modules of immune-related functions (table S20). The percentage denotes the fraction of proteins within a given module regulated by the rs704 locus.

Fig. 5 Network-associated pSNPs at the APOE and BCHE loci regulate proteins of module PM11.

(A) A circular Manhattan plot for the E(PM11), revealing two distinct genomic loci at chromosomes 3 (BCHE) and 19 (APOE / TOMM40). The three npSNPs at the APOE/TOMM40 locus are not correlated (r2 = 0). (B) The npSNPs at the two genomic regions exert a strong cis-acting effect on the serum levels of APOE (left) and BCHE (right). (C) The npSNPs at the APOE locus affected APOE in cis and mediated trans effects on 38 proteins, whereas the npSNP rs1803274, a missense variant (NP_000046.1: pAla567Thr) in BCHE, affected BCHE in cis and 20 other proteins in trans (table S20). (D) The distinct npSNPs at the APOE and BCHE loci regulate 88.9% of all proteins that constitute PM11, Fisher’s exact test P = 3 × 10−34, as demonstrated in the Venn diagram to the right. The number 3 refers to proteins in PM11 that are not regulated by the npSNPs.

The discoveries of npSNPs and cis-trans protein pairs allowed us to assess whether the serum protein networks resulted from cross-tissue regulatory control by comparing them to gene expression data from 53 different human tissues and involving only the top tissue-specific proteins (table S21 and S22). Many cis-acting pSNPs affected serum levels of tissue-specific proteins and subsequently affected variable serum levels of other proteins synthesized in distinct tissues (table S21 and S22). These data suggest that the serum protein network arose at least in part via systemic cross-tissue regulation.

Discussion

Deep protein profiling in a large well-characterized population of the elderly revealed the higher-order topology and modularity of the human serum proteome. The study cohort used was population-based and thus contained nondiseased donors and donors suffering from various diseases and complications (12). As such, the relationship between proteins is the consensus of the whole population. Recent studies have revealed a pronounced effect of aging on transcriptional changes that are both tissue and organismal specific (2628). It is possible that the observed network structure of the serum proteins reflects to some extent those changes associated with the old age of the population and may indeed reflect a certain set-point of homeostasis or even a decline in adaptive homeostasis in comparison to younger populations. Note that, for all the analyses of the present study, we adjusted for the confounder age.

Structural features of the serum protein network resembled those of regulatory networks constructed in solid tissues (7, 8), including association of hub proteins to diseases. However, comparison of cis-trans protein pairs and cognate protein modules to gene expression data and coexpression networks derived from solid tissues suggests that the serum protein network arose in part via systemic cross-tissue regulation. A follow-up replication testing of the many findings from these primary data, including the strong connection of protein hubs to disease traits, is warranted.

We anticipate that additional npSNPs can be identified with this technique, comparable to those found with conventional GWAS analyses of common diseases (25). Given that the effect of established GWAS loci is more complex than previously anticipated, this underscores the role of protein networks as the sensors and integrators of complex disease. The strong association of individual proteins and networks to disease states observed in this study indicates that the serum proteome may be a rich and accessible setting to mine for biomarkers of disease and disease responses to integrate information from tissues in a global regulatory network. As such, coordinated variance of serum proteins may offer unrecognized opportunities for target and biomarker identification in human disease.

Supplementary Materials

www.sciencemag.org/content/361/6404/769/suppl/DC1

Materials and Methods

Figs. S1 to S14

Tables S1 to S22

References (3168)

References and notes

Acknowledgments: We thank the staff at SomaLogic (CO) for performing the assays to measure protein levels, and P. MacNamara and G. Joyce of the Genomics Institute of the Novartis Research Foundation (GNF) for their leadership in supporting this work. We thank J. Loureiro, V. Swaroop, S. Abubucker, and F. Mapa of NIBR Cambridge for their contributions in support of the mass spectrometry workflow. We thank K. Bjarnadóttir, S. Gunnarsdóttir, and A. Hauksdóttir at the Icelandic Heart Association (IHA) for all specimen handling. Funding: Supported by Novartis Institute for Biomedical Research (NIBR), IHA, and in part by the intramural research program at the National Institute of Aging (N01-AG-12100 and HHSN271201200022C), the Althingi (the Icelandic Parliament), and the Icelandic Centre for Research (RANNIS) grant 141101-051. Author contributions: Conception and design: V.E., J.R.L., and V.G.; Data curation: E.F.G., T.A., S.R.H., V.T., N.F., S.S., G.E., T.B.H., L.J.L.; Methodology and analysis: V.E., M.I., J.R.L., T.A., E.F.G., V.G., N.F., L.L.J., R.P., H.H., L.S., X.Y., S.A.L., J.T., J.Z., B.Z., J.Z., A.P.O., A.M.,Va.G., Ö.O., and J.J.; Supervision: V.E., V.G. J.R.L., and L.L.J.; Writing—original draft: V.E.; Writing—review and editing: V.E., J.R.L., V.G., N.F., L.L.J., M.I., Va.G., L.S., and X.Y. Competing interests: J.T., S.R.H., V.T., N.F., R.P., L.L.J, A.P.O., and J.R.L. are all employees and stockholders of Novartis. All other authors declare no competing interests. Data and materials availability: The raw mass spectrometry data (DDA or MRM) were deposited to the ProteomeXchange Consortium with the five dataset identifiers PXD008819 to PXD008823, and the dataset identifier PASS01145. The custom-design Novartis SOMAscan is available through a collaboration agreement with the Novartis Institutes for BioMedical Research (lori.jennings@novartis.com). Data from the AGES Reykjavik study are available through collaboration (AGES_data_request@hjarta.is) under a data usage agreement with the IHA. All data supporting the conclusions of the paper are presented in the main text; supplementary materials; and Excel tables S1, S3, S4, S6, S7, S9, S10, S13 to S15, S17, S21, and S22 hosted online. A description of methods and materials is available in the supplementary materials.
View Abstract

Stay Connected to Science

Navigate This Article