Report

Realizing private and practical pharmacological collaboration

See allHide authors and affiliations

Science  19 Oct 2018:
Vol. 362, Issue 6412, pp. 347-350
DOI: 10.1126/science.aat4807

Sharing pharmaceutical research

Increased collaboration will enhance our ability to predict new therapeutic drug candidates. Such data sharing is currently limited by concerns about intellectual property and competing commercial interests. Hie et al. introduce an end-to-end pipeline, using modern cryptographic tools, for secure pharmacological collaboration. Multiple entities can thus securely combine their private datasets to collectively obtain more accurate predictions of new drug-target interactions. The computational pipeline is practical, producing results with improved accuracy in a few days over a wide area network on a real dataset with more than a million interactions.

Science, this issue p. 347

Abstract

Although combining data from multiple entities could power life-saving breakthroughs, open sharing of pharmacological data is generally not viable because of data privacy and intellectual property concerns. To this end, we leverage modern cryptographic tools to introduce a computational protocol for securely training a predictive model of drug–target interactions (DTIs) on a pooled dataset that overcomes barriers to data sharing by provably ensuring the confidentiality of all underlying drugs, targets, and observed interactions. Our protocol runs within days on a real dataset of more than 1 million interactions and is more accurate than state-of-the-art DTI prediction methods. Using our protocol, we discover previously unidentified DTIs that we experimentally validated via targeted assays. Our work lays a foundation for more effective and cooperative biomedical research.

Collaborative efforts to develop new, life-saving drug therapies have recently begun to take shape among pharmaceutical companies and academic labs, despite the highly competitive nature of the industry (15). Driving this transformation is the stalled or declining productivity of existing drug development pipelines amidst growing financial and regulatory pressures. Many in industry and academia are realizing that the difficult task of identifying novel drug candidates would be more successful if they could leverage pooled experimental datasets and knowledge that go beyond any single organization (68).

Until now, however, such forms of collaboration (13), including open-access data sharing partnerships like the Structural Genomics Consortium (www.thesgc.org), have been of limited scope because pharmacological data sharing is fundamentally restricted by concerns about intellectual property and other financial interests. Currently, entities must moderate the amount of data they share in order to maintain the confidentiality of drugs under development or the set of potential targets being tested, both of which may contain sensitive information about underlying research or business strategies.

Modern cryptography offers techniques to broaden pharmacological collaboration by greatly mitigating privacy concerns associated with data sharing. Secure multiparty computation (MPC) protocols (9) allow multiple entities to compute over their private datasets without revealing any information about the underlying raw data, except for the final computational output. Unfortunately, the promise of privacy-preserving collaboration has been severely hindered by the inability of existing secure computation frameworks to scale to complex computations over large datasets. In particular, analyzing a large amount of experimental data to predict new therapeutic interactions typically involves sophisticated algorithms that present a major computational challenge for MPC.

Here we introduce a proof-of-concept, end-to-end pipeline for collaborative drug–target interaction (DTI) prediction based on secure MPC. Conceptually, our protocol divides computation across collaborating entities while ensuring that none of the entities have any knowledge about the private data (Fig. 1). We achieve this result by using a cryptographic framework known as “secret sharing” (10) in which a private value (a “secret”) is collectively represented by multiple entities. Each entity is given a random number (a “share”) in a finite field (i.e., integers modulo some prime number p) such that the sum of all shares modulo p equals the secret. Any strict subset of entities cannot extract any information about the underlying secret using the subset’s shares. Various protocols have been developed for performing elementary operations (e.g., addition or multiplication) over secret-shared inputs (10, 11), which, taken together, form the building blocks for a general-purpose MPC framework.

Fig. 1 Secure pipeline for pharmacological collaboration.

Collaborating entities (e.g., pharmaceutical companies or research laboratories) have large private datasets of DTIs, as well as corresponding chemical structures and protein sequences. In our protocol, the entities first use secret sharing to pool their data in a way that reveals no information about the underlying drugs, targets, or interactions (step 1). The collaborating entities then jointly execute a cryptographic protocol that trains a predictive model (e.g., a neural network) on the pooled dataset (step 2). The final model can be made available to participating entities or may be used to distribute DTI predictions to participants in a way that encourages greater data sharing (step 3).

Although secret sharing–based MPC typically requires overwhelming amounts of data communication between entities for complex and large-scale computations, very recent optimizations have leveraged techniques such as generalized Beaver multiplication triples and shared pseudorandom number generators to substantially reduce communication cost, thus enabling practical secure computation for challenging problems such as genome-wide association studies for 1 million individuals (12). Even with these advances, however, secure MPC is still infeasible for existing DTI prediction methods (1316), primarily because their computations scale quadratically with the number of drugs (n) and the number of targets (m) in the dataset (e.g., n2 or nm), which is prohibitive for realistic datasets with millions of compounds.

To achieve scalable computation while maintaining high accuracy, we draw from recent advances in deep learning (17) to train a neural network model for DTI prediction. Our neural network takes feature representations of a compound and a target as input and predicts the interactivity of the given pair. Although we use chemical structure fingerprints and protein domain annotations as input features in our computational experiments (materials and methods), our framework readily generalizes to alternative features. We circumvent the quadratic complexity of existing methods by training our neural network over a dataset consisting of only the observed DTIs and a comparable number of putatively noninteracting drug–target pairs; this dataset is typically much smaller than the full drug-by-target matrix. Furthermore, we greatly reduce the cryptographic overhead of secure neural network training by optimizing our architectural choices for efficient MPC, such as using the rectifier (18) as our activation function and hinge loss as our loss function, both of which require only a single data-oblivious comparison to evaluate (materials and methods). These operations can be more efficiently implemented in MPC than alternatives such as the sigmoid function, which requires many such comparisons to accurately approximate. Taken together, our efficient protocol allows our neural network to securely train over a wide area network (WAN) in less than 4 days on a dataset with more than 1 million training instances (table S1). In contrast, a recently proposed protocol for privacy-preserving neural network training (19) requires months of communication time over a WAN to train on an image dataset of a smaller scale (60,000 examples, 784 features).

To demonstrate the accuracy of our securely trained neural network for DTI prediction (named Secure DTI), we compared it to state-of-the-art DTI prediction techniques, including those based on matrix factorization with side information (CMF) (14), network diffusion (NetLapRLS and BLMNII) (16, 20), and heterogeneous data integration (DTINet and HNM) (15, 21) on a standard benchmark dataset (22) with 1923 observed interactions (materials and methods). In addition to newly ensuring the confidentiality of the pooled input data during the computation, Secure DTI surpasses the performance of all baseline plaintext methods in cross-validation accuracy (Fig. 2A and fig. S1), a surprising result in light of the optimizations we made to achieve practical scalability. Our improvement over the best-performing baseline method (DTINet) is statistically significant (one-sided Wilcoxon rank-sum P value = 0.006) and further pronounced in a more challenging but realistic evaluation setting where the entire interaction space is considered as the test data rather than a balanced number of positive and negative examples (fig. S1F). None of the baseline methods can be efficiently implemented in MPC for the purposes of secure collaboration, owing to their quadratic complexity in the number of drugs and targets. In contrast, matrix factorization without side information (MF) lends itself to efficient MPC (23) but at the cost of greatly diminished model performance (Fig. 2, A and B, and fig. S1).

Fig. 2 Prediction of DTIs.

(A) Predictions from the DrugBank 3.0 dataset. Bar height corresponds to mean AUPR (area under the precision-recall curve), and error bars indicate SD. We compared Secure DTI to the plaintext methods BLMNII (20), NetLapRLS (16), HNM (21), MF (13), CMF (14), and DTINet (15), as reported in Luo et al. (15), by means of 10-fold cross-validation on balanced training and test sets (materials and methods; see fig. S1 for other evaluation settings). (B) Predictions form the STITCH 5 dataset with more than 1 million drug–target pairs. Secure DTI is compared with matrix factorization with (CMF) and without (MF) side information (see fig. S2 for other evaluation settings). Solid lines, sampling negative examples randomly; dashed lines, sampling negative examples while matching the relative frequencies of drugs and targets to those in the positive examples, representing a more challenging test case. Reported AUPRs are for the solid curves. (C) Runtime of our training protocol, over a local area network (LAN), for different dataset sizes. Box height represents SD.

We next set out to demonstrate the scalability and predictive performance of Secure DTI on a much larger dataset that more accurately represents the scale of cross-institutional collaboration. We obtained 969,817 interactions from the STITCH 5 human dataset (24), which is, to our knowledge, the largest publicly available DTI dataset, and evaluated the cross-validation performance of Secure DTI. Even on the challenging task of predicting DTIs of previously unseen compounds, Secure DTI achieved high accuracy [area under the precision-recall curve (AUPR) of 0.95], which substantially outperforms matrix factorization methods (AUPRs of 0.50 and 0.43) (Fig. 2B and fig. S2). Owing to their quadratic scalability, other baseline methods could not be reasonably applied to a dataset of this size (even in plaintext). In contrast, Secure DTI took less than 4 days to train on millions of interactions over a WAN (materials and methods) and efficiently scaled with a linear dependence on the number of interactions in the dataset (Fig. 2C and table S1). Even when training Secure DTI on 2 million interactions, we extrapolate the total runtime for one epoch (one linear pass over the full, shuffled training set) to be ~2.2 days. In practice, we expect our model to achieve high accuracy in only a few training epochs; we obtained all of our reported results after 1.5 epochs. Additionally, given that our protocol admits flexibility in the choice of predictive model, we also securely trained a support vector machine (SVM) instead of a neural network; the SVM reduced predictive performance (fig. S2).

To go beyond cross-validation and demonstrate the potential for novel discoveries that can result from our collaborative pipeline, we trained Secure DTI on all STITCH 5 interactions and scored the remaining possible drug–target pairs for interactivity, which is closer to how our pipeline would be used in a real-world setting. We controlled for bias toward highly represented drugs and targets in the dataset by either (i) filtering out any prediction involving both a drug and target highly represented in the original dataset (Secure DTI-A) or (ii) sampling negative examples (i.e., noninteractions) during model training such that each drug or target was seen at the same relative frequency in the negative examples as in the positive examples (Secure DTI-B). In both cases, many of our top predictions (5 of 12 for Secure DTI-A and 9 of 12 for Secure DTI-B) were validated by our own targeted assay experiments or by published experimental studies that have not yet been deposited into the STITCH database (Table 1). Our validation experiments suggest a previously unknown interaction between imatinib and ErbB4, for which we could not find any existing experimental support. It will be interesting to find out whether this interaction is confirmed by other studies. The top prediction from both methods was an interaction between the estrogen receptor (ER) and droloxifene, which had reached phase 3 clinical trials as an ER modulator for advanced breast cancer (25). Similarly, the predicted interaction between the vitamin D receptor (VDR) and seocalcitol has been clinically well established (26). Furthermore, some predictions without direct activity have strong evidence for an indirect functional interaction; for example, nutlin-3 has been shown to inhibit poly(ADP-ribose) polymerase 1 (PARP1) protein levels through p53-dependent proteasomal degradation in mouse fibroblasts (27). All of our findings were obtained without revealing any information about the underlying drugs, targets, and interactions during the computation.

Table 1 Predicted out-of-dataset DTIs.

We trained Secure DTI on all human DTIs from STITCH 5, which we used to score and rank all pairs of drugs and targets that are not in the STITCH database. We implemented two methods to control for model bias toward overrepresented drugs and targets, either (i) filtering out predictions involving a drug and target that are both highly represented in STITCH (Secure DTI-A) or (ii) retraining Secure DTI such that the negative training examples had an equal representation of drugs and targets compared with the positive training examples (Secure DTI-B) (materials and methods). Interactions labeled N/A involve commercially unavailable compounds and thus could not be tested. “Active” interactions have median inhibitory concentrations <100 μM, whereas “inconclusive” interactions demonstrate observable activity only at one or two high-concentration levels, a potential artifact of compound aggregation. We labeled the interaction between actinomycin D and PARP1 as “weakly active” because consistent activity was observed over a wide range of concentrations, including near-50% inhibition at 100 μM (our highest tested concentration). However, it should be noted that its dose-response curve does not follow a typical sigmoidal shape. References and additional information are provided in table S2.

View this table:

To provide enhanced interpretability of our reported results, we incorporated an additional step into our pipeline for securely evaluating the impact of individual input features on the prediction outcome (supplementary text). When applied to our top predictions from the STITCH database, this capability linked drugs to specific ligand-binding or functional sites within the predicted target (table S3).

We envision a real-world scenario in which multiple participating entities contribute secret-shared data to train a machine learning model on a joint pharmacological dataset (Fig. 1). After training, the model could be made available to all participants or could remain private such that entities receive a number of predictions commensurate with the amount of data they contribute in order to incentivize participation. As more training data will most likely result in greatly improved performance (fig. S3 and supplementary text), entities will be incentivized to share information in a way that is mutually beneficial and has provable privacy guarantees.

Our pipeline is secure under the “honest-but-curious” security model in which the collaborating entities follow the correct protocol and do not collude to reconstruct the data. This is a substantial improvement over the current state of biomedical research, in which privacy concerns hinder any collaboration, but our framework can also be extended to achieve even stronger security guarantees. Because our security guarantee holds as long as at least one entity is honest during the main computation (materials and methods), we can relax the no-collusion requirement by introducing additional collaborating entities into our protocol, which does not substantially increase total computation time but does increase communication linearly in the number of entities. If we require security against malicious entities who deviate from the protocol during the online computation, we can include a message authentication code (MAC) with each message. At the end of the protocol, the MAC is verified to ensure that each step was performed according to the protocol specification, a technique introduced in the SPDZ framework (28). This approach roughly doubles computation and communication, offering a trade-off between security and performance that can be adjusted according to specific study requirements.

Although we did not add noise to the final computation output of our pipeline to limit information leakage, a technique known as differential privacy (29, 30), a method being developed for differentially private neural networks can be used in conjunction with our protocol (31). An alternative strategy for collaborative neural network–based prediction is to train local models in plaintext and use secure protocols only when periodically averaging over these models, thus minimizing the amount of cryptographic overhead (32, 33). However, this approach is vulnerable to reverse engineering–based attacks in which a malicious collaborator jointly trains a local model (e.g., a generative adversarial network) that uncovers information about private data owned by honest collaborators, even when differential privacy techniques are applied (34). In contrast, securely training a single model over a decentralized network of computing parties, as in our pipeline, is not vulnerable to such attacks.

Our privacy-preserving protocols generalize to other large-scale data sharing problems beyond drug discovery, with the highest potential for impact in areas that suffer from a lack of collaboration due to privacy concerns, such as predictive analyses of electronic health records. Our practical demonstration of secure, large-scale machine learning with neural networks may also provide a useful blueprint for enhancing privacy in many other domains where neural networks have been shown to be successful.

Supplementary Materials

www.sciencemag.org/content/362/6412/347/suppl/DC1

Materials and Methods

Supplementary Text

Figs. S1 to S4

Tables S1 to S3

References (3558)

References and Notes

Acknowledgments: We thank E. Irvine and P. Macaluso for assistance with validation experiments. Funding: B.H. and H.C. are partially supported by NIH grant R01GM081871 (to B.B.). H.C. is also partially supported by the Kwanjeong Educational Foundation. Author contributions: B.H., H.C., and B.B. developed the computational methods. B.H. and H.C. implemented the protocol and performed the analyses. B.B. guided the research. All authors wrote the manuscript. Competing interests: A provisional patent application on this work (serial no. 62/611,678) was filed on 29 December 2017. Data and materials availability: Code and data are available at http://secure-dti.csail.mit.edu. Drug–target interaction datasets were obtained from DrugBank version 3.0 (www.drugbank.ca/) and STITCH version 5.0 (http://stitch.embl.de/). Protein domain information is available from the Pfam database (https://pfam.xfam.org/).
View Abstract

Stay Connected to Science

Navigate This Article