An Information-Intensive Approach to the Molecular Pharmacology of Cancer

See allHide authors and affiliations

Science  17 Jan 1997:
Vol. 275, Issue 5298, pp. 343-349
DOI: 10.1126/science.275.5298.343


Since 1990, the National Cancer Institute (NCI) has screened more than 60,000 compounds against a panel of 60 human cancer cell lines. The 50-percent growth-inhibitory concentration (GI50) for any single cell line is simply an index of cytotoxicity or cytostasis, but the patterns of 60 such GI50 values encode unexpectedly rich, detailed information on mechanisms of drug action and drug resistance. Each compound's pattern is like a fingerprint, essentially unique among the many billions of distinguishable possibilities. These activity patterns are being used in conjunction with molecular structural features of the tested agents to explore the NCI's database of more than 460,000 compounds, and they are providing insight into potential target molecules and modulators of activity in the 60 cell lines. For example, the information is being used to search for candidate anticancer drugs that are not dependent on intact p53 suppressor gene function for their activity. It remains to be seen how effective this information-intensive strategy will be at generating new clinically active agents.

Drug discovery is being transformed by new developments in molecular cell biology and the information sciences. A case in point is the drug discovery program conducted by the Developmental Therapeutics Program (DTP) of the NCI. Before 1985, the NCI used mice bearing murine leukemia P388 cells to screen new compounds for anticancer activity. That strategy identified agents active against leukemias but relatively few that were effective against solid tumors, including the most common human carcinomas. Hence, the NCI established a primary screen in which compounds are tested in vitro for their ability to inhibit growth of 60 different human cancer cell lines (1). Included are melanomas, leukemias, and cancers of breast, prostate, lung, colon, ovary, kidney, and central nervous system origin. A highly schematic view of this portion of the NCI drug discovery-development process is shown in Fig. 1. Compounds for testing have come principally from synthetic chemistry and natural product sources, but combinatorial libraries and products of biotechnology are also being screened.

Fig. 1.

Simplified schematic overview of an information-intensive approach to cancer drug discovery and molecular pharmacology at the NCI. Each row of the activity (A) database represents the pattern of activity of a particular compound across the 60 cell lines. As described in the text, the A database can be related to a structure (S) database containing 2D or 3D chemical structure characteristics of the compounds and a target (T) database containing information on possible molecular targets or modulators of activity within the cells.

This “disease-oriented” strategy for drug discovery was based on the hypothesis that selective activity in vitro against cancer cell lines from a particular organ would predict selective activity against corresponding tumors in humans. That concept is being tested as agents progress through clinical trials, and the answer is not yet clear. However, patterns of activity observed in the screen have proved predictive in an even more powerful way at the molecular level: They provide incisive information on the mechanisms of action of the compounds tested and on molecular targets and modulators of activity within the cancer cells. The cell lines are not fully representative of solid tumors in humans, but their patterns of pharmacological response are rich in information. We refer to this test system as a “screen,” but it has also become a way to “profile” or “fingerprint” potential therapeutic agents.

The patterns of activity were first analyzed by the COMPARE algorithm (2). Given one compound as a “seed,” COMPARE searches the database of screened agents for those most similar to the seed in their patterns of activity against the panel of 60 cell lines. Similarity in pattern often indicates similarity in mechanism of action, mode of resistance, and molecular structure (2). This form of analysis has been applied productively to topoisomerase II inhibitors (3), pyrimidine biosynthesis inhibitors (4), and tubulin-active compounds (5), among other classes of agents. Back-propagation neural networks and predictive methods from classical statistics have also been used to verify that the patterns of activity could predict a compound's mechanism of action (6). More detailed information on the relation between pattern and mechanism has come from additional analyses based on techniques from statistics and artificial intelligence (7, 8). To date, five compounds (spicamycin analog KRN 5500, flavopiridol, UCN-01, a depsipeptide, and a quinocarmycin analog) assessed in the screen and analyzed by the methods described above have been selected for entry into clinical trials (9).

Bioinformatics: The Structure, Activity, and Target Databases

Here we describe a general way in which information on the activity patterns is being combined with other types of information to address problems in drug discovery and molecular pharmacology. A formulation of this approach in terms of three databases is shown in Fig. 1: (A) contains the activity patterns already discussed, (S) contains molecular structural features of the tested compounds, and (T) contains possible targets or modulators of activity in the cells. Portions of these databases can be accessed through DTP's World Wide Web site ( Links to these and additional pertinent databases can be found at These two Web sites will be updated progressively with additional data and tools of analysis (10).

The chemical structure (S) database can be coded in terms of any set of two-dimensional (2D) or 3D molecular structure descriptors. The NCI's Drug Information System (DIS) contains chemical connectivity tables for approximately 460,000 molecules, including the 60,000 tested to date. Three-dimensional structures have been obtained for 97% of the DIS compounds, and a set of 588 bit-wise descriptors has been calculated for each structure by use of the Chem-X computational chemistry package (ChemDBS-3D module, Chemical Design, Oxford, U.K.) (11). This data set provides the basis for pharmacophoric searches; if a tested compound, or set of compounds, is found to have an interesting pattern of activity, its structure can be used to search for similar molecules in the DIS database (12).

In the target (T) database, each row defines the pattern (across 60 cell lines) of a measured cell characteristic that may mediate, modulate, or otherwise correlate with the activity of a tested compound. When the term is used in this general shorthand sense, a “target” may be the site of action or part of a pathway involved in a cellular response. Among the potential targets assessed to date are oncogenes, tumor-suppressor genes, drug resistance-mediating transporters, heat shock proteins, telomerase, cytokine receptors, molecules of the cell cycle and apoptotic pathways, DNA repair enzymes, components of the cytoarchitecture, intracellular signaling molecules, and metabolic enzymes (13).

In addition to the targets assessed one at a time, others have been measured en masse as part of a protein expression database generated for the 60 cell lines by 2D polyacrylamide gel electrophoresis (2D PAGE) (14). The aim is to look for molecules that have not been considered previously as targets. In the process, a link has been established between the molecular pharmacology of cancer and the growing enterprise of proteome research (15). The current database consists of 1014 indexed and quantitated protein spots, of which 151 have been quality controlled over all 60 current cell lines and incorporated into a primary data set for analysis (14). Analogous links to genome research are being established through analyses of gene amplification and mRNA expression patterns. Figure 1 indicates approximately 100 targets, but that number is increasing rapidly.

Relating Molecular Targets to Drug Activity Patterns

The first target analyzed in detail by the COMPARE program was the drug-resistance transporter P-glycoprotein (Pgp), encoded by multidrug resistance gene MDR-1 (16, 17, 18). The result was a list of agents predicted and then experimentally verified to be good Pgp substrates. Related strategies identified Pgp inhibitors (19). We present here a complementary approach for analysis and display of these data, the DISCOVERY program package (20), which maps coherent patterns in the data, rather than treating the compounds and targets one pair at a time. Because the S, A, and T databases contain, in aggregate, many millions of numbers, the challenge was to compact that information sufficiently for analysis without losing or obscuring important local features of the data. These often contradictory requirements have guided development of DISCOVERY, which integrates the disparate types of information on the compounds and displays them in novel ways suited to human pattern recognition. The same algorithms can be applied to other types of databases, including those generated by screening and profiling systems in which agents are tested in multiple assays—for example, against mammalian cells, yeast mutants, bacteria, or biochemical targets.

Figure 2 shows a color-coded DISCOVERY pattern map relating a T database of 113 target vectors to an A database of 3989 nonconfidential compounds deemed sufficiently interesting in the initial screen to be tested more than once. This map was obtained by an algorithm we term “clustered correlation” (ClusCor). Each database was treated as a mathematical matrix, and the following four steps were applied: (i) each row of A and T was normalized by its mean and standard deviation; (ii) the two matrices were multiplied to obtain A·T′, where the prime symbol indicates the matrix transpose; (iii) each entry was divided by n − 1, where n (=60) is the number of cell lines, producing a matrix of Pearson correlation coefficients relating activity and target patterns; and (iv) the rows and columns of the product matrix were rearranged into “cluster order.” Only with this last step did patterns emerge.

Fig. 2.

“Clustered correlation” (ClusCor) map of the relation between compounds tested and molecular targets in the cells. This normalized A·T′ product matrix (where the prime symbol indicates the matrix transpose) correlates target patterns across the 60 cell lines with patterns of growth inhibition for an important set of 3989 compounds. A red or orange point (high positive Pearson correlation coefficient) indicates that the agent tends to be selectively active against cell lines that express the target in large amounts (or in functional form). A dark blue point (high negative correlation) indicates the opposite tendency (selective potency against cell lines that have less target or function). The 113 columns correspond to 76 distinct target molecules or functions, some represented multiple times in different mathematical transformations. Compounds and targets are cluster-ordered as explained in the text. To the right is shown one 61-leaf “twig” of the overall 3989-leaf cluster tree of compounds. Symbols for mechanisms of action (6, 8) are as follows: T1, topoisomerase 1 inhibitors; T2, topoisomerase 2 inhibitors; A, alkylating agents; Pt, platinum compounds (of the cisplatin-carboplatin family); Pt-Si, platinum agents containing a silane moiety;?, mechanism unknown; PCNA, proliferating cell nuclear antigen determined from 2D gels (column 16) (14); p53 seq, p53 sequence, wild-type versus mutant (30); p53 fu., p53 function in a yeast-based assay (30); p53 prot., p53 protein expression by protein immunoblot (columns 29 and 30) (30); hsp, heat shock-related proteins (Hsp60, Hsc70, Hsp90, Grp75, Grp78) from 2D gels (columns 40 to 45) (14); gadd45, mdm2, and p21, GADD45, MDM2, and p21CIP1/WAF1 mRNA induction in response to γ-irradiation (columns 54 to 57, 60, and 61 to 64, respectively) (30); G1, G1 arrest in response to γ-irradiation, assessed by flow cytometry (columns 65 to 69) (30); mrp, mRNA expression levels for the MRP multidrug resistance transporter (columns 75 and 76) (18); mdr, MDR-1 mRNA (16) and function in terms of rhodamine efflux (columns 81 to 88) (17); TGF-αR, transforming growth factor-α receptor mRNA (columns 89 to 91); EGFR, epidermal growth factor receptor (column 92) (37); and Ras, RAS sequence, wild-type versus mutant (38).

The 3989 compounds were cluster-ordered (21) along the ordinate on the basis of their activity patterns across the 60 cell lines. Thus, compounds with the most nearly identical patterns appear side by side. Because this clustering of compounds was done independently of targets, the coherent patterns observed as patches of color validate the hypothesis that the activity patterns and targets are related. The possibility that these patterns were created spuriously by the clustering process is ruled out by the lack of pattern features in Fig. 5A. Figure 5A shows the result when the 60 activity values for each drug were randomly permuted before the calculation and clustering algorithm that had produced Fig. 2 were applied. The 113 targets were cluster-ordered along the abscissa in Fig. 2 on the basis of their apparent effect on activities of compounds in the database. Thus, targets with the most similar columns of correlation coefficients appear side by side.

Fig. 5.

Four types of “clustered correlation” (ClusCor) matrices involving the S, A, and T databases. (A) Activity vectors of the compounds were randomly permuted, and all calculations (including clustering) were then done exactly as for Fig. 2. The lack of apparent pattern verifies that clustering did not spuriously create the patterns seen in Fig. 2. (B) A normalized T·T′ database, which cross-correlates patterns of target expression. Targets with the most similar patterns of expression appear side by side. Because a target's expression is 100% correlated with itself by definition, all values on the principal diagonal are color-coded red. Because of the clustering, targets positively correlated in their expression produce red patches straddling the diagonal. (C) A normalized (A·T′)·(A·T′) database, similar to (B) except that targets are characterized, not in terms of their expression levels, but in terms of their correlations with activity patterns of the 3989 compounds. (D) An S′·(A·T′) database. This database relates 2D substructures of the compounds (20) to targets through the activity patterns of the compounds.

To illustrate the result of the clustering process, the right-hand side of Fig. 2 shows one small 61-leaf “twig” of the overall 3989-leaf cluster tree. Compounds similar in mechanism of action cluster together. Among the classes that are organized in a coherent way elsewhere in Fig. 2 are the Taxol (paclitaxel) analogs (taxanes): 34 of the 37 taxanes in the database appear side by side (compounds 620 to 653), and the other 3 are found on nearby twigs (compounds 655, 658, and 701). The largest chemically coherent set of compounds is a set of 72 thiosemicarbazones (compounds 1491 to 1579, with small gaps occupied by phenylhydrazones) (22). Most of the tin-containing molecules in the database are contiguous (compounds 2034 to 2062). The closely related clinical agents cisplatin and carboplatin fall side by side (compounds 3260 and 3261) within one cluster of 11 structurally related platinum analogs, whereas the diaminocyclohexyl platinum compounds, which have very different pharmacological behavior (23), fall elsewhere in the map (compounds 2838 to 2849). Perhaps more important than the branches with known agents, however, are those that contain no familiar compounds. The DISCOVERY program set, as its name implies, was developed primarily to explore and organize these new classes of compounds.

Although some degree of coherent clustering was expected for families of molecules related by chemistry or mechanism, the precision indicated by the above examples was unexpected; the a priori probability that any given pair of compounds would appear as nearest neighbors along the ordinate in the set of 3989 is only 2 in 3988. An explanation for the observed coherence is suggested by a thought experiment in which the patterns are considered, because of experimental noise, to be binary; that is, one is assumed to know only whether a cell line is more sensitive or less sensitive than the median. Then each compound would have one of 60!/(30!30!) = 1.2 × 1017 possible patterns (that is, the number of ways of choosing the 30 out of 60 that fall above the median). The number would increase to 260 = 1.2 × 1018 for all possible binary patterns and to 460 = 1.3 × 1036 if four levels of sensitivity could be reliably distinguished. Each compound displays a unique “fingerprint” pattern, defined by a point in the 60D space (one dimension for each cell line) of possible patterns. In information theoretic terms, the transmission capacity of this communication channel is very large, even after one allows for experimental noise and for biological realities that constrain the compounds to particular regions of the 60D space. Although the activity data have been accumulated over a 6-year period, the experiments have been reproducible enough to generate the patterns of coherence described here (24).

Each patch of color in Fig. 2 suggests a possible correlation between targets and compounds. The dark blue patch for compounds 513 to 667 indicates that these compounds are highly negative in their correlation with targets 81 to 88, which are all indices of Pgp/Mdr-1 expression and function (16, 17, 18). Several lines of evidence indicate the significance of this observation. (i) We analyzed cell screen data for a set of 35 compounds of diverse structure and mechanism that had been reported previously on the basis of transport assays to be Mdr-1 substrates (17, 25). Of these, 18 (51%) fell within the blue patch, a percentage 13-fold greater than the 4% (155/3989) expected by chance alone. The probability (exact binomial) of such an extreme event happening by chance is <0.0001. (ii) Although 18 of 35 reported substrates fell within the patch, 0 of 12 compounds reported not to be substrates (17, 25) did so (P = 0.0010 by one-sided Fisher's exact test for the associated 2 by 2 table). (iii) It has been reported (17) that Mdr-1 substrates tend to be natural products, high in molecular weight, and often cationic. We find by linear discriminant analysis that these three factors predict with a sensitivity of 78% and a specificity of 84% which compounds will be found in the blue patch (P < 0.0001). These findings further validate the patterns seen in Fig. 2.

Columns 76 and 77 in Fig. 2 are indices of messenger RNA (mRNA) expression for Mrp, another transport molecule associated with multidrug resistance (18). There is only a slight overlap between the Mdr-1- and Mrp-sensitive families of compounds. As indicated by columns 40 to 45, high basal levels of heat shock proteins (Hsp60, Hsp90, Hsc70, Grp75, and Grp78) correlate positively with activity for a large set of agents, including some of those in the group sensitive to Mdr-1. This type of analysis makes it possible to cross-compare multiple targets for their expression levels and for their apparent impact on the activities of different classes of agents (26).

Activity Patterns and p53 Pathway Status

The p53 tumor-suppressor gene is mutated in more than 50% of human tumors, more than any other gene examined to date (27). p53 functions as a transcriptional regulator with the ability to both transactivate and suppress gene transcription (28). It is activated in response to DNA damage and can orchestrate a number of cellular responses to genotoxic stress, including G1 arrest and apoptosis (29). A large cluster of compounds (numbers 2802 to 3309) is positively correlated with intact p53 pathway status (as indicated by a large red patch in Fig. 2). The indices of p53 status assessed in the cells include p53 sequence, basal p53 protein level, p53 function in a yeast-based assay, G1 checkpoint integrity, and γ-ray induction of the p53-regulated genes p21CIP1/WAF1, MDM2, and GADD45 (30). The activity patterns of most of these compounds are inversely correlated with expression levels of p53 protein, as would be expected given that the protein is overexpressed in most p53-mutant cell types (29).

Compounds 2802 to 3309 include a large percentage of the familiar cytotoxic antitumor agents. Of 86 agents considered evaluable on the basis of phase II clinical trials (31), 45 appear in this relatively small region of the map, giving an odds ratio of (45/41)/(463/3440) = 8.2:1 (P < 0.0001 by Fisher's exact test). This odds ratio substantially understates the enrichment of this region of the map with clinical agents because the region is artificially enlarged by the many analogs synthesized on the basis of the clinical molecules (21).

The correlation of p53 pathway factors with activity patterns for a subset of the clinical agents with defined mechanisms of action (6, 8) is shown in Fig. 3. Most, although not all, of the agents damage DNA, and in this assay they tended to be more potent in p53 wild-type cells than in p53 mutant ones (32). The principal exception was the set of antimitotic tubulin-active agents, including Taxol, which generally do not show any clear correlation with p53 status. Examination of a previously defined set (6, 8) of 123 standard anticancer agents (which overlaps with the set of clinical agents studied here) yields similar results (30).

Fig. 3.

Relation between p53 pathway molecular targets and patterns of activity for clinically evaluated anticancer agents. The compounds have been grouped by their presumed principal mechanisms of action. A number of additional antitubulin agents have been added to increase representation of that category. Color coding indicates the Pearson correlation coefficient relating agent to target. A2, guanine-N2 alkylator; A7, guanine-N7 alkylator; AC, chloroethylating alkylator; D, DNA-RNA antimetabolite; PS, protein synthesis inhibitor; R, RNA antimetabolite; RF, antifolate RNA antimetabolite; T2, topoisomerase II inhibitor; TU, antitubulin (antimitotic) agent. The data on p53 pathway parameters are from (30).

The large majority of clinical agents appear in this assay to be more active on average in the p53 wild-type cells (Fig. 4A). In contrast, the p53 association is much less pronounced for the set of 3989 multiply tested molecules (Fig. 4B) or for all compounds tested. We examined compounds at the left of Fig. 4B for agents that might be effective in p53-mutant human tumors. In this search for “p53-inverse” (or at least “p53-indifferent”) compounds, we used the COMPARE and DISCOVERY program sets to generate lists of candidates on the basis of various sets of explicit criteria (20, 33). Selected compounds are being tested in p53-isogenic human cell sets (34), and lead compounds that perform favorably will be further evaluated in vivo.

Fig. 4.

Histograms showing the relation between p53 status and patterns of growth inhibition in the screen (A) for a set of 86 phase II-evaluable clinical agents and (B) for a set of 3989 multiply tested compounds. Most of the clinical agents appear more active in the presence of wild-type p53; the other compounds show a lesser trend in the same direction. The parameter calculated for each drug has the form of a Wilcoxon rank sum P value. P > 0.5 indicates a compound that tends in this screening assay to be more active in the cells with wild-type p53; P < 0.5 indicates the opposite tendency. Values >0.975 or <0.025 would be required to reject the null hypothesis of equal median activities in p53 wild-type and mutant cells for any single compound. The data on p53 sequence are from (30).

Target-Target and Target-Structure Correlations

As indicated by the ClusCor matrices shown in Fig. 5, the databases on activity, molecular structure, and targets have implications for basic biology and pharmacology as well as for drug discovery per se. The correlation of each target's pattern of expression across the 60 cell lines with that of each other target is shown in Fig. 5B. Values of the correlation coefficient on the main diagonal are, by definition, unity because each target is 100% correlated with itself. The red patches of high correlation straddling the diagonal appear because the targets are listed on both ordinate and abscissa in cluster order on the basis of patterns of expression. Clusters of targets related to Mdr-1, heat shock proteins, and p53 function show high degrees of internal correlation. In many instances (for example, that of p53 and induction of the downstream genes p21CIP1/WAF1, GADD45, and MDM2) this observation reflects the known biochemical relationships (27, 29), further validating the significance of patterns seen in Figs. 2 and 5.

A similar pattern of correlation is shown in Fig. 5C, which relates targets to each other, not in terms of their levels of expression, but in terms of their relation to activity profiles for the 3989 compounds in the database. Again, the same three families of targets are highly correlated. As the cells are characterized with respect to more and more targets, these correlations will generate an increasing number of testable cell biological hypotheses for further study. The relation of targets to chemical substructures of compounds through the database of activity patterns is shown in Fig. 5D. Although nonrandom patterns are apparent, they are less pronounced, and other, nonlinear methods of analysis (including ones based on genetic algorithm and neural networks) may prove to be better suited for analysis of this type of relationship.

Hypothesis Generation in the Molecular Pharmacology of Cancer

The approach to drug discovery and molecular pharmacology presented here serves a number of functions. (i) It suggests novel targets and mechanisms of action or modulation. (ii) It detects inhibition of integrated biochemical pathways not adequately represented by any single molecule or molecular interaction. (This feature of cell-based assays is likely to be more important in the development of therapies for cancer than it is for most other diseases; in the case of cancer, one is fighting the plasticity of a poorly controlled genome and the selective evolutionary pressures for development of drug resistance.) (iii) It provides candidate molecules for secondary testing in biochemical assays; conversely, it provides a well-characterized biological assay in vitro for compounds emerging from biochemical screens. (iv) It “fingerprints” tested compounds with respect to a large number of possible targets and modulators of activity. (v) It provides such fingerprints for all previously tested compounds whenever a new target is assessed in many or all of the 60 cell lines. (In contrast, if a battery of assays for different biochemical targets were applied to, for example, 60,000 compounds, it would be necessary to retest all of the compounds for any new target or assay.) (vi) It links the molecular pharmacology with emerging databases on molecular markers in microdissected human tumors—which, under the rubric of this article, constitute clinical (C) databases (35). (vii) It provides the basis for pharmacophore development and searches of an S database for additional candidates. If an agent with a desired action is already known, its fingerprint patterns of activity can be used by COMPARE, DISCOVERY, neural networks, and other pattern-recognition technologies to find similar compounds.

This approach to drug discovery and molecular pharmacology can be likened to a clinical trial with 60 patients (cell types), each profiled with respect to a variety of molecular markers and each treated with 60,000 different agents, one at a time. It can also be considered as a hypothesis generator based on a set of 60,000 × 60 = 3.6 million pharmacology experiments. The important word here is “hypothesis.” Information from the cell lines is fundamentally correlative and subject to confounding influences. Hypotheses generated must be tested by means of biochemical assays or isogenic systems that differ, insofar as possible, with respect to just one factor. Conversely, hypotheses based on experiments with particular isogenic cell sets can be assessed for generality according to whether they correctly predict responses for most of the 60 cell lines in the screen. For example, the overall impact of p53 function on cellular chemosensitivity can be affected by multiple genotypic and phenotypic factors that determine the balance between p53-mediated apoptosis on the one hand and G1 arrest and DNA repair on the other (29); results obtained for one parental cell type can be misleading if generalized to others. The target and activity databases have, increasingly, provided us with a basis for rational choice of parental and transfected cell pairs to use in experiments addressing particular biological questions.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
    A Kohonen neural network was used to generate self-organized 2D maps in which the distances among compounds reflected the differences in their patterns of activity.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    The term “proteome” was introduced, by analogy with “genome,” to denote in an aggregate sense the protein complement of a cell or organism.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
View Abstract

Navigate This Article