Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring

See allHide authors and affiliations

Science  15 Oct 1999:
Vol. 286, Issue 5439, pp. 531-537
DOI: 10.1126/science.286.5439.531


Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

The challenge of cancer treatment has been to target specific therapies to pathogenetically distinct tumor types, to maximize efficacy and minimize toxicity. Improvements in cancer classification have thus been central to advances in cancer treatment. Cancer classification has been based primarily on morphological appearance of the tumor, but this has serious limitations. Tumors with similar histopathological appearance can follow significantly different clinical courses and show different responses to therapy. In a few cases, such clinical heterogeneity has been explained by dividing morphologically similar tumors into subtypes with distinct pathogeneses. Key examples include the subdivision of acute leukemias, non-Hodgkin's lymphomas, and childhood “small round blue cell tumors” [tumors with variable response to chemotherapy (1) that are now molecularly subclassified into neuroblastomas, rhabdomyosarcoma, Ewing's sarcoma, and other types (2)]. For many more tumors, important subclasses are likely to exist but have yet to be defined by molecular markers. For example, prostate cancers of identical grade can have widely variable clinical courses, from indolence over decades to explosive growth causing rapid patient death. Cancer classification has been difficult in part because it has historically relied on specific biological insights, rather than systematic and unbiased approaches for recognizing tumor subtypes. Here we describe such an approach based on global gene expression analysis.

We divided cancer classification into two challenges: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes, which could reflect current states or future outcomes.

We chose acute leukemias as a test case. Classification of acute leukemias began with the observation of variability in clinical outcome (3) and subtle differences in nuclear morphology (4). Enzyme-based histochemical analyses were introduced in the 1960s to demonstrate that some leukemias were periodic acid-Schiff positive, whereas others were myeloperoxidase positive (5). This provided the first basis for classification of acute leukemias into those arising from lymphoid precursors (acute lymphoblastic leukemia, ALL) or from myeloid precursors (acute myeloid leukemia, AML). This classification was further solidified by the development in the 1970s of antibodies recognizing either lymphoid or myeloid cell surface molecules (6). Most recently, particular subtypes of acute leukemia have been found to be associated with specific chromosomal translocations—for example, the t(12;21)(p13;q22) translocation occurs in 25% of patients with ALL, whereas the t(8;21)(q22;q22) occurs in 15% of patients with AML (7).

Although the distinction between AML and ALL has been well established, no single test is currently sufficient to establish the diagnosis. Rather, current clinical practice involves an experienced hematopathologist's interpretation of the tumor's morphology, histochemistry, immunophenotyping, and cytogenetic analysis, each performed in a separate, highly specialized laboratory. Although usually accurate, leukemia classification remains imperfect and errors do occur.

Distinguishing ALL from AML is critical for successful treatment; chemotherapy regimens for ALL generally contain corticosteroids, vincristine, methotrexate, and l-asparaginase, whereas most AML regimens rely on a backbone of daunorubicin and cytarabine (8). Although remissions can be achieved using ALL therapy for AML (and vice versa), cure rates are markedly diminished, and unwarranted toxicities are encountered.

We set out to develop a more systematic approach to cancer classification based on the simultaneous expression monitoring of thousands of genes using DNA microarrays (9). It has been suggested (10) that such microarrays could provide a tool for cancer classification. Microarray studies to date (11), however, have primarily been descriptive rather than analytical and have focused on cell culture rather than primary patient material, in which genetic noise might obscure an underlying reproducible expression pattern.

We began with class prediction: How could one use an initial collection of samples belonging to known classes (such as AML and ALL) to create a “class predictor” to classify new, unknown samples? We developed an analytical method and first tested it on distinctions that are easily made at the morphological level, such as distinguishing normal kidney from renal cell carcinoma (12). We then turned to the more challenging problem of distinguishing acute leukemias, whose appearance is highly similar.

Our initial leukemia data set consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis (13). RNA prepared from bone marrow mononuclear cells was hybridized to high-density oligonucleotide microarrays, produced by Affymetrix and containing probes for 6817 human genes (14). For each gene, we obtained a quantitative expression level. Samples were subjected to a priori quality control standards regarding the amount of labeled RNA and the quality of the scanned microarray image (15).

The first issue was to explore whether there were genes whose expression pattern was strongly correlated with the class distinction to be predicted. The 6817 genes were sorted by their degree of correlation (16). To establish whether the observed correlations were stronger than would be expected by chance, we developed a method called “neighborhood analysis” (Fig. 1A). Briefly, one defines an “idealized expression pattern” corresponding to a gene that is uniformly high in one class and uniformly low in the other. One tests whether there is an unusually high density of genes “nearby” (that is, similar to) this idealized pattern, as compared to equivalent random patterns.

Figure 1

Schematic illustration of methodology. (A) Neighborhood analysis. The class distinction is represented by an “idealized expression pattern” c, in which the expression level is uniformly high in class 1 and uniformly low in class 2. Each gene is represented by an expression vector, consisting of its expression level in each of the tumor samples. In the figure, the data set is composed of six AMLs and six ALLs. Geneg 1 is well correlated with the class distinction, whereas g 2 is poorly correlated. Neighborhood analysis involves counting the number of genes having various levels of correlation with c. The results are compared to the corresponding distribution obtained for random idealized expression patterns c*, obtained by randomly permuting the coordinates of c. An unusually high density of genes indicates that there are many more genes correlated with the pattern than expected by chance. The precise measure of distance and other methodological details are described in (16,17) and on our Web site ( (B) Class predictor. The prediction of a new sample is based on ”weighted votes“ of a set of informative genes. Each such gene gi votes for either AML or ALL, depending on whether its expression level xi in the sample is closer to μAML or μALL (which denote, respectively, the mean expression levels of AML and ALL in a set of reference samples). The magnitude of the vote iswivi , wherewi is a weighting factor that reflects how well the gene is correlated with the class distinction andvi = |xi − (μAML + μALL)/2| reflects the deviation of the expression level in the sample from the average of μAML and μALL. The votes for each class are summed to obtain total votes V AML andV ALL. The sample is assigned to the class with the higher vote total, provided that the prediction strength exceeds a predetermined threshold. The prediction strength reflects the margin of victory and is defined as (V win V lose)/(V win +V lose), where V win andV lose are the respective vote totals for the winning and losing classes. Methodological details are described in (19, 20) and on the Web site.

For the 38 acute leukemia samples, neighborhood analysis showed that roughly 1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance (Fig. 2) (17). This suggested that classification could indeed be based on expression data.

Figure 2

Neighborhood analysis: ALL versus AML. For the 38 leukemia samples in the initial data set, the plot shows the number of genes within various “neighborhoods” of the ALL-AML class distinction together with curves showing the 5 and 1% significance levels for the number of genes within corresponding neighborhoods of the randomly permuted class distinctions (16,17). Genes more highly expressed in ALL compared to AML are shown in the left panel; those more highly expressed in AML compared to ALL are shown in the right panel. The large number of genes highly correlated with the class distinction is apparent. In the left panel (higher in ALL), the number of genes with correlationP(g,c) > 0.30 was 709 for the AML-ALL distinction, but had a median of 173 genes for random class distinctions.P(g,c) = 0.30 is the point where the observed data intersect the 1% significance level, meaning that 1% of random neighborhoods contain as many points as the observed neighborhood around the AML-ALL distinction. Similarly, in the right panel (higher in AML), 711 genes withP(g,c) > 0.28 were observed, whereas a median of 136 genes is expected for random class distinctions.

The second issue was how to use a collection of known samples to create a “class predictor” capable of assigning a new sample to one of two classes. We developed a procedure that uses a fixed subset of “informative genes” (chosen based on their correlation with the class distinction) and makes a prediction on the basis of the expression level of these genes in a new sample. Each informative gene casts a “weighted vote” for one of the classes, with the magnitude of each vote dependent on the expression level in the new sample and the degree of that gene's correlation with the class distinction (Fig. 1B) (18, 19). The votes were summed to determine the winning class, as well as a “prediction strength” (PS), which is a measure of the margin of victory that ranges from 0 to 1 (20). The sample was assigned to the winning class if PS exceeded a predetermined threshold, and was otherwise considered uncertain. On the basis of previous analysis, we used a threshold of 0.3 (21).

The third issue was how to test the validity of class predictors. We used a two-step procedure. The accuracy of the predictors was first tested by cross-validation on the initial data set. (Briefly, one withholds a sample, builds a predictor based only on the remaining samples, and predicts the class of the withheld sample. The process is repeated for each sample, and the cumulative error rate is calculated.) One then builds a final predictor based on the initial data set and assesses its accuracy on an independent set of samples.

We applied this approach to the 38 acute leukemia samples. The set of informative genes to be used in the predictor was chosen to be the 50 genes most closely correlated with AML-ALL distinction in the known samples. The parameters of the predictor were determined by the expression levels of these 50 genes in the known samples. The predictor was then used to classify new samples, by applying it to the expression levels of these genes in the sample.

The 50-gene predictors derived in cross-validation tests assigned 36 of the 38 samples as either AML or ALL and the remaining two as uncertain (PS < 0.3) (22). All 36 predictions agreed with the patients' clinical diagnosis.

We then created a 50-gene predictor on the basis of all 38 samples and applied it to an independent collection of 34 leukemia samples. The specimens consisted of 24 bone marrow and 10 peripheral blood samples (23). In total, the predictor made strong predictions for 29 of the 34 samples, and the accuracy was 100%. The success was notable because the collection included a much broader range of samples, including samples from peripheral blood rather than bone marrow, from childhood AML patients, and from different reference laboratories that used different sample preparation protocols. Overall, the prediction strengths were quite high (median PS = 0.77 in cross-validation and 0.73 in independent test) (Fig. 3A). The average prediction strength was lower for samples from one laboratory that used a very different protocol for sample preparation. This suggests that clinical implementation of such an approach should include standardization of sample preparation.

Figure 3

(A) Prediction strengths. The scatterplots show the prediction strengths (PSs) for the samples in cross-validation (left) and on the independent sample (right). Median PS is denoted by a horizontal line. Predictions with PS < 0.3 are considered as uncertain. (B) Genes distinguishing ALL from AML. The 50 genes most highly correlated with the ALL-AML class distinction are shown. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. Expression levels for each gene are normalized across the samples such that the mean is 0 and the SD is 1. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates SDs above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML. Although these genes as a group appear correlated with class, no single gene is uniformly expressed across the class, illustrating the value of a multigene prediction method. For a complete list of gene names, accession numbers, and raw expression values,

The choice to use 50 informative genes in the predictor was somewhat arbitrary. The number was well within the total number of genes strongly correlated with the class distinction (Fig. 2), seemed likely to be large enough to be robust against noise, and was small enough to be readily applied in a clinical setting. In fact, the results were insensitive to the particular choice: Predictors based on between 10 and 200 genes were all found to be 100% accurate, reflecting the strong correlation of genes with the AML-ALL distinction (24).

The list of informative genes used in the AML versus ALL predictor was highly instructive (Fig. 3B). Some, including CD11c,CD33, and MB-1, encode cell surface proteins for which monoclonal antibodies have been demonstrated to be useful in distinguishing lymphoid from myeloid lineage cells (25). Others provide new markers of acute leukemia subtype. For example, the leptin receptor, originally identified through its role in weight regulation, showed high relative expression in AML. The leptin receptor was recently demonstrated to have anti-apoptotic function in hematopoietic cells (26). Similarly, the zyxin gene has been shown to encode a LIM domain protein important in cell adhesion in fibroblasts, but a role in hematopoiesis has not been reported (27).

We had expected that the genes most useful in AML-ALL class prediction would simply be markers of hematopoietic lineage, and would not necessarily be related to cancer pathogenesis. However, many of the genes encode proteins critical for S-phase cell cycle progression (Cyclin D3, Op18, and MCM3), chromatin remodeling (RbAp48 and SNF2), transcription (TFIIEβ), and cell adhesion (zyxin and CD11c) or are known oncogenes (c-MYB, E2A andHOXA9). In addition, one of the informative genes encodes topoisomerase II, which is the principal target of the antileukemic drug etoposide (28). Together, these data suggest that genes useful for cancer class prediction may also provide insight into cancer pathogenesis and pharmacology.

The methodology of class prediction can be applied to any measurable distinction among tumors. Importantly, such distinctions could concern a future clinical outcome—such as whether a prostate cancer turns out to be indolent or a breast cancer responds to a given chemotherapy. We explored the ability to predict response to chemotherapy among the 15 adult AML patients who had been treated with an anthracycline-cytarabine regimen and for whom long-term clinical follow-up was available (29). Eight patients failed to achieve remission after induction chemotherapy, while the remaining seven remained in remission for 46 to 84 months. Neighborhood analysis found no striking excess of genes correlated with response to chemotherapy, in contrast to the situation for the AML-ALL distinction, and class predictors that used 10 to 50 genes were not highly accurate in cross-validation. We thus found no evidence of a strong multigene expression signature correlated with clinical outcome, although this could reflect the relatively small sample size. Nonetheless, we examined the most highly correlated genes for potential biological significance. The single most highly correlated gene out of the 6817 genes was the homeobox gene HOXA9, which was overexpressed in patients with treatment failure. Notably, HOXA9 is rearranged by a t(7;11)(p15;p15) chromosomal translocation in a rare subset of AML patients, who tend to have poor outcomes (30). Furthermore, HOXA9 overexpression has been shown to transform myeloid cells in vitro and to cause leukemia in animal models (31). A general role for HOXA9 expression in predicting AML outcome has not been previously suggested. Larger studies will be needed to test this hypothesis.

We next turned to the question of class discovery. The initial identification of cancer classes has been slow, typically evolving through years of hypothesis-driven research. We explored whether cancer classes could be discovered automatically. For example, if the AML-ALL distinction were not already known, could it have been discovered simply on the basis of gene expression?

Class discovery entails two issues: (i) developing algorithms to cluster tumors by gene expression and (ii) determining whether putative classes produced by such clustering algorithms are meaningful—that is, whether they reflect true structure in the data rather than simply random aggregation.

To cluster tumors, we used a technique called self-organizing maps (SOMs), which is particularly well suited to the task of identifying a small number of prominent classes in a data set (32). In this approach, the user specifies the number of clusters to be identified. The SOM finds an optimal set of “centroids” around which the data points appear to aggregate. It then partitions the data set, with each centroid defining a cluster consisting of the data points nearest to it.

We applied a two-cluster SOM to automatically group the 38 initial leukemia samples into two classes on the basis of the expression pattern of all 6817 genes (33). We first evaluated the clusters by comparing them to the known AML-ALL classes (Fig. 4A). The SOM paralleled the known classes closely: Class A1 contained mostly ALL (24 of 25 samples) and class A2 contained mostly AML (10 of 13 samples). The SOM was thus quite effective, albeit not perfect, at automatically discovering the two types of leukemia.

Figure 4

ALL-AML class discovery. (A) Schematic representation of two-cluster SOM. A two-cluster (2 by 1) SOM was generated from the 38 initial leukemia samples, with a modification of the GENECLUSTER computer package (32). Each of the 38 samples is thereby placed into one of two clusters on the basis of patterns of gene expression for the 6817 genes assayed in each sample. Cluster A1 contains the majority of ALL samples (gray squares) and cluster A2 contains the majority of AML samples (black circles). (B) Prediction strength (PS) distributions. The scatterplots show the distribution of PS scores for class predictors. The first two plots show the distribution for the predictor created to classify samples as ”A1-type“ or ”A2-type“ tested in cross-validation on the initial data set (median PS = 0.86) and on the independent data set (median PS = 0.61). The remaining plots show the distribution for two predictors corresponding to random classes. In these cases, the PS scores are much lower (median PS = 0.20 and 0.34, respectively), and about half of the samples fall below the threshold for prediction (PS = 0.3). A total of 100 such random predictors were examined, to calculate the distribution of median PS scores to evaluate the statistical significance of the predictor for A1-A2 (36). (C) Schematic representation of the four-cluster SOM. AML samples are shown as black circles, T-lineage ALL as open squares, and B-lineage ALL as gray squares. T- and B-lineages were differentiated on the basis of cell-surface immu- nophenotyping. Class B1 is exclusively AML, class B2 contains all eight T-ALLs, and classes B3 and B4 contain the majority of B-ALL samples. (D) Prediction strength (PS) distributions for pair-wise comparison among classes. Cross-validation prediction studies show that the four classes could be distinguished with high prediction scores, with the exception of classes B3 and B4. These two classes could not be easily distinguished from one another, consistent with their both containing primarily B-ALL samples, and suggesting that B3 and B4 might best be merged into a single class.

We then considered how one could evaluate such putative clusters if the “right” answer were not already known. We reasoned that class discovery could be tested by class prediction: If putative classes reflect true structure, then a class predictor based on these classes should perform well.

To test this hypothesis, we evaluated the clusters A1 and A2. We constructed predictors to assign new samples as “type A1” or “type A2.” Predictors that used a wide range of different numbers of informative genes performed well in cross-validation. For example, a 20-gene predictor gave 34 accurate predictions with high prediction strength, one error, and three uncertains (34). The one “error” was the assignment of the sole AML sample in class A1 to class A2, and two of the three uncertains were ALL samples in class A2. The cross-validation thus not only showed high accuracy, but actually refined the SOM-defined classes: With one exception, the subset of samples accurately classified in cross-validation were those perfectly subdivided by the SOM into ALL and AML classes. The results suggest an iterative procedure for refining clusters, in which an SOM is used to initially cluster the data, a predictor is constructed, and samples not correctly predicted in cross-validation are removed. The edited data set could then be used to generate an improved predictor to be tested on an independent data set (35).

We then tested the class predictor of the A1-A2 distinction on the independent data set. In the general case of class discovery, predictors for novel classes cannot be assessed for “accuracy” on new samples, because the “right” way to classify the independent samples is not known. Instead, however, one can assess whether the new samples are assigned a high prediction strength. High prediction strengths indicate that the structure seen in the initial data set is also seen in the independent data set. The prediction strengths, in fact, were quite high: The median PS was 0.61, and 74% of samples were above threshold (Fig. 4B). To assess these results, we performed the same analyses with random clusters. Such clusters consistently yielded predictors with poor accuracy in cross-validation and low prediction strength on the independent data set (Fig. 4B). On the basis of such analysis (36), the A1-A2 distinction can be seen to be meaningful, rather than simply a statistical artifact of the initial data set. The results thus show that the AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge.

We then sought to extend the class discovery by searching for finer subclasses of the leukemias. We used a SOM to divide the samples into four clusters (denoted B1 to B4). We subsequently obtained immunophenotype data on the samples and found that the four classes largely corresponded to AML, T-lineage ALL, B-lineage ALL, and B-lineage ALL, respectively (Fig. 4C). The four-cluster SOM thus divided the samples along another key biological distinction.

We again evaluated these classes by constructing class predictors (37). The four classes could be distinguished from one another, with the exception of B3 versus B4 (Fig 4D). The prediction tests thus confirmed the distinctions corresponding to AML, B-ALL, and T-ALL, and suggested that it may be appropriate to merge classes B3 and B4, composed primarily of B-lineage ALL.

The class discovery approach thus automatically discovered the distinction between AML and ALL, as well as the distinction between B-cell and T-cell ALL. These are the most important distinctions known among acute leukemias, both in terms of underlying biology and clinical treatment. With larger sample collections, it would be possible to search for finer subclassifications. It will be interesting to see whether they correspond to existing subclassifications for AML and ALL or define new groupings perhaps based on fundamental similarities in mechanism of transformation.

In principle, the class discovery techniques above can be used to identify fundamental subtypes of any cancer. In general, such studies will require careful experimental design to avoid potential experimental artifacts—especially in the case of solid tumors. Biopsy specimens, for example, might have gross differences in the proportion of surrounding stromal cells. Blind application of class discovery could result in identifying classes reflecting the proportion of stromal contamination in the samples, rather than underlying tumor biology. Such “classes” would be real and reproducible, but would not be of biological or clinical interest. Various approaches could be used to avoid such artifacts—such as microscopic examination of tumor samples to ensure comparability, purification of tumor cells by flow sorting or laser-capture microdissection, computational analysis that excludes genes expressed in stromal cells, and confirmation of candidate marker genes by RNA in situ hybridization or immunohistochemistry to tumor sections.

Class discovery methods could also be used to search for fundamental mechanisms that cut across distinct types of cancers. For example, one might combine different cancers (for example, breast tumors and prostate tumors) into a single data set, eliminate those genes that correlate strongly with tissue type, and then cluster the samples based on the remaining genes.

We also describe techniques for class prediction, whereby samples can be automatically assigned to already-recognized classes. Creation of a new predictor involves expression analysis of thousands of genes to select a set of informative genes (we used 50 genes, although other choices also performed well) and then validating the accuracy of the assignments made on the basis of these genes. Subsequent application of the predictor then requires only monitoring the expression level of these informative genes. We described a class predictor able to accurately assign samples as AML or ALL. We have also similarly constructed a class predictor that accurately assigns ALL samples as either T-ALL or B-ALL (38). These class predictors could be adapted to a clinical setting, with appropriate steps to standardize the protocol for sample preparation. We envisage such a test supplementing rather than replacing existing leukemia diagnostics. Indeed, this would provide an opportunity to gain clinical experience with the use of expression-based class predictors in a well-studied cancer, before applying them to cancers with less well-developed diagnostics.

More generally, class predictors may be useful in a variety of settings. First, class predictors can be constructed for known pathological categories—reflecting a tumor's cell of origin, stage, or grade. Such predictors could provide diagnostic confirmation or clarify unusual cases. This point was illustrated by a recent anecdotal experience. A patient with a classic leukemia presentation (pancytopenia, circulating “blasts”) was diagnosed with AML, but with atypical morphology. We took the opportunity to apply our class predictor to a bone marrow sample from this patient. The classifier produced extremely low vote totals for both AML and ALL: Neither lymphoid- nor myeloid-specific genes were highly expressed, thus bringing into question the diagnosis of acute leukemia. Examination of the expression profile revealed that genes more highly expressed relative to the leukemias included those encoding tropomyosin, muscle-specific actin, decorin, and IGF-2, suggestive of a mesenchymal origin, such as muscle (39). In fact, independent cytogenetic analysis identified a t(2;13)(q35;q14) translocation characteristic of the muscle tumor alveolar rhabdomyosarcoma (40). The patient's diagnosis was revised accordingly, and treatment was changed from AML therapy to rhabdomyosarcoma therapy. This experience underscores the fact that leukemia diagnosis remains imperfect and could benefit from a battery of expression-based predictors for various cancers.

Most importantly, the technique of class prediction can be applied to distinctions relating to future clinical outcome, such as drug response or survival. Class prediction provides an unbiased, general approach to constructing such prognostic tests, provided that one has a collection of tumor samples for which eventual outcome is known.

  • * To whom correspondence should be addressed. E-mail: golub{at}; lander{at}

  • These authors contributed equally to this work.


View Abstract

Navigate This Article