Research Article

The Genomic Landscapes of Human Breast and Colorectal Cancers

See allHide authors and affiliations

Science  16 Nov 2007:
Vol. 318, Issue 5853, pp. 1108-1113
DOI: 10.1126/science.1145720

Abstract

Human cancer is caused by the accumulation of mutations in oncogenes and tumor suppressor genes. To catalog the genetic changes that occur during tumorigenesis, we isolated DNA from 11 breast and 11 colorectal tumors and determined the sequences of the genes in the Reference Sequence database in these samples. Based on analysis of exons representing 20,857 transcripts from 18,191 genes, we conclude that the genomic landscapes of breast and colorectal cancers are composed of a handful of commonly mutated gene “mountains” and a much larger number of gene “hills” that are mutated at low frequency. We describe statistical and bioinformatic tools that may help identify mutations with a role in tumorigenesis. These results have implications for understanding the nature and heterogeneity of human cancers and for using personal genomics for tumor diagnosis and therapy.

Discovery of the genes mutated in human cancer has provided key insights into the mechanisms underlying tumorigenesis and has proven useful for the design of a new generation of targeted approaches for clinical intervention (1). With the determination of the human genome sequence and improvements in sequencing and bioinformatic technologies, systematic analyses of genetic alterations in human cancers have become possible (24).

Using such large-scale approaches, we recently studied the genomes of breast and colorectal cancers by determining the sequence of the Consensus Coding Sequence (CCDS) genes, a collection of the best-annotated protein-coding genes (5). In this study, we have extended these analyses to include examination of all of the Reference Sequence (RefSeq) genes. The RefSeq database is a comprehensive, nonredundant collection of annotated gene sequences that represents a consolidation of gene information from all major gene databases (6). The RefSeq database is believed to include the great majority of human gene sequences and represents the gold standard in the field.

Sequencing strategy. The first step in our approach was the design of primers that would permit polymerase chain reaction (PCR)-based amplification and analysis of coding exons in the RefSeq database. Of the 20,857 transcripts in the RefSeq database (representing 18,191 distinct genes), 14,661 transcripts were included in the CCDS set. These CCDS genes were in general not evaluated again; the only exceptions were a small subset in which particular regions of interest had been difficult to amplify and for these, new PCR primers were designed. For the remaining 6196 Refseq transcripts, 125,624 primers were designed and used to amplify the coding exons. The entire list of primers used to amplify the exons of the RefSeq genes (including the CCDS genes) is provided in table S1.

The primers were used to PCR-amplify and sequence the DNA from 11 breast and 11 colorectal cancers, as well as DNA from matched normal tissues of two patients. The samples used for this analysis were the same as those used in the previous study of CCDS genes (5). The sequence data from this Discovery Screen were assembled and evaluated using stringent quality criteria (7), resulting in successful analysis of 93% of targeted amplicons. We used bioinformatic and experimental strategies to distinguish germline variants and artifacts of PCR or sequencing from true somatic mutations (fig. S1). Genetic alterations found in the two normal samples and those present in SNP databases were removed and sequence traces of the remaining potential alterations were visually inspected to remove false-positive calls in the automated analysis. After these steps, the amplicons of the remaining alterations were re-amplified from the tumor DNA (to ensure reproducibility) and from DNA of matched normal tissue (to remove un-annotated germline variants). Finally, the putative somatic mutations were examined “in silico” (by computer analysis) to ensure that the alterations did not occur as a result of mistargeted amplification of related regions of the genome (7).

To further evaluate the genes with somatic mutations in the Discovery Screen, we determined their sequence in a Validation Screen of 24 additional samples of the same tumor type in which the mutation was originally identified. Methods similar to those noted above were used to exclude germline variants, PCR and sequencing artifacts, and alterations due to mistargeted amplification of related genomic regions. Amplicons with putative somatic mutations were reamplified in DNA from the tumor and from matched normal tissues to determine whether the alterations were truly somatic.

Somatic mutations. Combining the data from the current analysis with those previously obtained in CCDS genes, we found that 1718 genes (9.4% of the 18,191 genes analyzed) had at least one nonsilent mutation in either a breast or colorectal cancer (Table 1 and table S3). The great majority of alterations were single-base substitutions (92.7%), with 81.9% resulting in missense changes, 6.5% resulting in stop codons, and 4.3% resulting in alterations of splice sites or untranslated regions immediately adjacent to the start and stop codons (Table 1). The remaining somatic mutations were insertions, deletions, or duplications (7.3%). The mutation spectrum of colorectal cancers differed from that of breast cancers, and these spectra were similar to those observed in the previous CCDS study and in other analyses (4, 5). In this study, we analyzed the nature of the nonsynonymous mutations in more detail and found a very large excess of C to T transitions at 5′-CpG-3′ in colorectal cancers, representing 19 times as many as expected from the representation of 5′-CpG-3′ sites in the coding regions of the genome. Similarly, there was a marked excess of G to C transversions at 5′-GpA-3′ sites in breast cancers, representing 4.5 times as many as expected (7).

Table 1.

Summary of somatic mutations. UTR, untranslated region. ND, not determined because synonymous mutations were not evaluated in the RefSeq genes analyzed in (5).

View this table:

Passenger mutation rates. The somatic mutations found in cancers are either “drivers” or “passengers” (4). Driver mutations are causally involved in the neoplastic process and are positively selected for during tumorigenesis. Passenger mutations provide no positive or negative selective advantage to the tumor but are retained by chance during repeated rounds of cell division and clonal expansion.

We used two independent methods to estimate the passenger mutation rates in the analyzed cancers. First, we evaluated 23.8 Mb of chromosome 8 in 11 colorectal cancer samples similar to those used in the Discovery Screen. This was performed with high-density oligonucleotide microarrays containing every possible single-base pair substitution. The tumors used for this analysis each had only one allele of chromosome 8 [i.e., they showed loss of heterozygosity (LOH)], rendering the detection of sequence alterations sensitive and reliable. A total of 151 somatic mutations were identified in 262 Mb of tumor DNA, and all but one of these were located in noncoding regions. Thus, there were a total of 0.6 noncoding mutations per Mb analyzed (95% confidence interval: 0.52 to 0.64 mutations/Mb). Because only one copy of chromosome 8 was analyzed in these studies, the noncoding mutation rate per diploid genome was inferred to be 1.2 mutations/Mb. We then performed detailed LOH analyses of the 11 tumors used in the Discovery Screen using 317,503 polymorphisms. An average of 16% of polymorphic alleles showed LOH. It is known from studies of human genetic variation that the frequency of nonsynonymous (amino acid–changing) mutations is approximately half that of mutations in noncoding regions (8, 9). After correcting for LOH and the difference in mutation rates between noncoding and nonsynonymous mutations, these analyses result in an estimated passenger mutation rate of 0.55 nonsynonymous mutations per Mb of tumor DNA in colorectal cancers (7). We consider this a minimum estimate as the ratio of mutations in noncoding regions to nonsynonymous mutations in coding regions is likely to be higher in the germ line than in tumors because of greater negative selection for mutations in coding regions in the germ line. Although we have not directly measured mutation rates in noncoding sequences in breast cancers, Stephens et al. have estimated that the rate of nonsynonymous mutations in breast cancers is 0.33 per Mb, and we used this as our minimum estimate for this tumor type (10).

Estimates of the passenger mutation rates were also obtained through the quantification of synonymous (silent) missense mutations in this study. Because most synonymous changes are expected to be biologically inert and thereby not selected for or against during tumorigenesis, such changes can be used as a tool to estimate passenger mutation rates (11). The analysis of synonymous mutations provided two estimates of the nonsynonymous mutation rate (7). One estimate was based on the ratio of nonsynonymous to synonymous mutations observed in the human germ line (8, 9). The second estimate was derived by calculating the expected ratio of nonsynonymous to synonymous changes after accounting for codon usage of RefSeq genes and the different mutation spectra observed in colorectal and breast cancers. We considered this estimate to be a maximum because it did not take into account that nonsynonymous mutations that retard cell growth will be selected against during tumorigenesis.

Evaluating mutated genes. The mutational data obtained can be used to identify candidate cancer genes (CAN-genes) that are most likely to be drivers and are therefore most worthy of further investigation. In this study, we considered a gene to be a CAN-gene if it harbored at least one nonsynonymous mutation in both the Discovery and Validation Screens and if the total number of mutations per nucleotide sequenced exceeded a minimum threshold (7). Using these criteria, we identified a total of 280 CAN-genes, equally distributed between colorectal and breast cancers (table S4, A and B, respectively). The 280 CAN-genes listed in table S4, A and B, included most of the 191 CAN-genes identified in Sjöblom et al. (5) but differed by virtue of the inclusion of 114 new CAN-genes identified in the additional 6196 transcripts sequenced, the removal of data from a breast tumor with an abnormally high passenger mutation rate, the use of an experimental rather than statistical definition of CAN-genes, and additional evaluation of mutations in samples that had undergone whole-genome amplification (7).

It is reasonable to assume that genes that are mutated more frequently than predicted by chance are more likely to be drivers. In this study, we used a more sophisticated version of a metric, called the cancer mutation prevalence (CaMP) score, to rank genes by the number and nature of the mutations observed (table S4, A and B). To assess the likelihood that each of these genes is mutated at a frequency higher than the passenger mutation rate, we devised a method based on Empirical Bayes simulations (7). Though the likelihoods depend on the passenger rates (table S4, A and B), the rankings of the genes by CaMP scores are similar regardless of the assumed passenger mutation rates (rank correlations > 0.9). CaMP scores thereby provide priorities for future studies that are independent of many of the assumptions required to calculate passenger probabilities.

To determine the mutation prevalence of a subset of CAN-genes with more precision, we analyzed 40 CAN-genes in a separate cohort of 96 patients with colorectal cancers (7). The genes chosen were in biological pathways of interest to our groups and included those ranked 1st to 119th by CaMP scores. Colorectal cancers, rather than breast tumors, were chosen because more purified tumor tissues of this type were available. Twenty-five of the 40 genes (62%) were found to be mutated in at least one of the 96 cancers and, as predicted from our data and simulations, most were mutated in 5% or less of the cancers (table S5). The remaining 15 CAN-genes were not mutated in any of the additional 96 cancers studied, but this finding is still compatible with these genes being mutated in a low but significant fraction of tumors; the evaluation of more colorectal tumors than the 131 included in our study would be necessary to exclude this possibility.

Additional analyses of mutated genes. Mutation frequency is not the only type of information that can help determine whether a mutated gene is worthy of further evaluation. The analyses of the predicted effects on protein function can add independent evidence helpful for prioritization of specific genes and mutations for future research. For example, mutations producing stop codons, out-of-frame insertions or deletions, or splice site abnormalities are very likely to interfere with the normal function of the gene product (tables S3 and S4). To evaluate missense changes, we used two sequence-based methods for evaluating the probability that a specific alteration would have a deleterious effect on protein function: Sorting Intolerant from Tolerant (SIFT) and LogR.E-values based on Pfam domains (7). These probabilities are listed for each evaluable mutation identified in our study in table S3. For each CAN-gene, the number of missense mutations that were predicted to disrupt function in a statistically significant manner is included in table S4.

Predictions about the functional effects of mutations can also be made at the structural level. We generated structural models for 622 of the RefSeq gene mutations from x-ray crystallography or nuclear magnetic resonance spectroscopy of their encoded or related proteins (12, 13). Some of the models were intriguing in that they showed clustering of mutations around active sites of proteins or near an interface residue (examples in Fig. 1). We also used LS-SNP software (14) to predict the likelihood that each mutation would destabilize the protein, interfere with the formation of a domain-domain interface, or have an effect on protein-ligand binding (table S3, summarized for CAN-genes in table S4).

Fig. 1.

Clustering of somatic mutations in protein structures. Individual somatic mutations were mapped onto structural homology models on the basis of known crystal structure information. Homology models were built with MODPIPE (33) and graphics were created with UCSF Chimera software (34). Yellow spheres indicate mutated residues. (A) Two somatic mutations in the glycosylation enzyme GALNT5 occur in residues on different sides of the enzyme active site. Stick models indicate enzyme substrates. (B) Three somatic mutations in the transglutaminase TGM3 located at nearby surface regions of the protein (two mutations are present at the same residue on the right-hand side).

Finally, we identified a number of mutations that occurred at locations identical to those of genes involved in hereditary human diseases or that clustered at adjacent locations in the cancers analyzed. Such alterations are likely to have functional effects on these proteins. These included the R360W mutation (substitution of arginine 360 with tryptophan) in the RET tyrosine kinase, corresponding to an identical loss-of-function germline change in Hirschsprung disease (15). Likewise, the R1624W mutation in the PKHD1 gene in colorectal cancer is identical to that observed in polycystic kidney disease, a syndrome that has neoplastic features (16). The T745M mutation (substitution of threonine 745 with methionine) in the cell adhesion gene CRB1 gene is identical to one that has been shown to be a cause of retinitis pigmentosa (17). In addition to these examples, we identified 126 mutations in 39 proteins that occurred within a distance of 10 amino acids from one another. In particular, mutations in at least two independent tumors occurred in the DTNB, EDD1, GNAS, and TGM3 genes at exactly the same residue, implicating that region as vital to the protein's potential tumorigenic function.

Analysis of mutated pathways. It is becoming increasingly clear that pathways rather than individual genes govern the course of tumorigenesis (1). Mutations in any of several genes of a single pathway can thereby cause equivalent increases in net cell proliferation. Accordingly, we devised a method to determine whether the genes within specific pathways were mutated more often than predicted by chance. The resultant “pathway CaMP” score incorporated the total number of mutations from all genes within each group, the number of different genes mutated, the combined sizes of the genes in each group, and the total number of tumors examined (table S6) (7).

Using this metric, we analyzed a highly curated database (Metacore, GeneGo, Inc.) that includes human protein-protein interactions, signal transduction and metabolic pathways, and a variety of cellular functions and processes. By including the number of mutated genes in addition to the total number of mutations as parameters, we excluded pathways that simply contained one gene that was mutated at high frequency (e.g., pathways containing only TP53 mutations). There were 108 pathways that were found to be preferentially mutated in breast tumors. Many of the pathways involved phosphatidylinositol 3-kinase (PI3K) signaling (Fig. 2 and table S6B). Mutations in PIK3CA are frequent in multiple tumor types, including breast cancers (1821). In this study, we identified mutations not only in PIK3CA, but also previously unreported mutations in GAB1, IKBKB, IRS4, NFKB1, NFKBIA, NFKBIE, PIK3R1, PIK3R4, and RPS6KA3, implicating both the PI3K pathway in general and nuclear factor κB (NF-κB) signaling in particular in breast tumorigenesis. Within the 38 colorectal cancer pathways that appeared to be mutated in a statistically significant manner, there were also many that centered on PI3K (table S6A). The pathway components mutated in colorectal cancers differed from those in breast, with mutations found in IRS2, IRS4, PIK3R5, PRKCZ, PTEN, RHEB, and RPS6KB1 in addition to PIK3CA. Additional pathways altered in colorectal cancer were related to cell adhesion, the cytoskeleton, and the extracellular matrix (table S6A), supporting the idea that interactions between the cancer cell and the extracellular environment are important steps in the neoplastic process.

Fig. 2.

PI3K pathway mutations in breast and colorectal cancers. The identities and relationships of genes that function in PI3K signaling are indicated. Circled genes have somatic mutations in colorectal (red) and breast (blue) cancers. The number of tumors with somatic mutations in each mutated protein is indicated by the number adjacent to the circle. Asterisks indicate proteins with mutated isoforms that may play similar roles in the cell. These include insulin receptor substrates IRS2 and IRS4; phosphatidylinositol 3-kinase regulatory subunits PIK3R1, PIK3R4, and PIK3R5; and NF-κB regulators NFKB1, NFKBIA, and NFKBIE.

Finally, there were nine examples of mutated genes whose protein products were predicted to interact with other mutated genes more often than predicted by chance. The average number of mutant gene products with which these nine mutant genes interacted was 25 (table S6). These results illustrate the potential utility of pathway-based analyses and highlight a variety of different gene groups and pathways that can help focus further investigations on these tumor types.

The genomic landscapes of colorectal and breast cancers. The colorectal and breast cancers analyzed in the Discovery Screen contained a median of 76 and 84 nonsilent mutations in RefSeq genes, respectively (table S2). The number of mutations per tumor was similar among colorectal tumors (ranging from 49 to 111) but was more variable in breast cancers (varying from 38 to 193). The number of mutated CAN-genes per tumor averaged 15 and 14 in colorectal and breast cancers, respectively.

The “landscapes” of typical colorectal and breast cancer genomes are depicted in Fig. 3. In these landscapes, every RefSeq gene is represented by a point on a two-dimensional map corresponding to its chromosomal position, and all mutated genes in that tumor are indicated by a dot. The relief feature of the map is provided by the CAN-genes with the 60 highest CaMP scores (table S4). Just as topographical maps contain geological features of varying elevations, the cancer genome landscape consists of relief features (mutated genes) with heterogeneous heights (determined by CaMP scores). There are a few “mountains” representing individual CAN-genes mutated at high frequency. However, the landscapes contain a much larger number of “hills” representing the CAN-genes that are mutated at relatively low frequency. It is notable that this general genomic landscape (few gene mountains and many gene hills) is a common feature of both breast and colorectal tumors.

Fig. 3.

Cancer genome landscapes. Nonsilent somatic mutations are plotted in two-dimensional space representing chromosomal positions of RefSeq genes. The telomere of the short arm of chromosome 1 is represented in the rear left corner of the green plane and ascending chromosomal positions continue in the direction of the arrow. Chromosomal positions that follow the front edge of the plane are continued at the back edge of the plane of the adjacent row, and chromosomes are appended end to end. Peaks indicate the 60 highest-ranking CAN-genes for each tumor type, with peak heights reflecting CaMP scores (7). The dots represent genes that were somatically mutated in the individual colorectal (Mx38) (A) or breast tumor (B3C) (B) displayed. The dots corresponding to mutated genes that coincided with hills or mountains are black with white rims; the remaining dots are white with red rims. The mountain on the right of both landscapes represents TP53 (chromosome 17), and the other mountain shared by both breast and colorectal cancers is PIK3CA (upper left, chromosome 3).

Discussion. The results reported here add to those published previously (5) in several important ways. First, we report the sequences of an additional 5168 genes in 22 tumors. These new data provide a much more complete picture of the cancer genome, allowing us to formulate landscapes of breast and colorectal tumors (Fig. 3). We predict that the key features of this landscape—a few gene mountains interspersed with many gene hills—will prove to be a general feature of most solid tumors. Second, we present data on noncoding and synonymous mutations in addition to nonsynonymous mutations. As well as providing information useful for estimating the passenger rate, the data in table S2 show that passenger rates vary considerably from tumor to tumor, undoubtedly determined by their intrinsic mutability and the number of generations and bottlenecks through which they have evolved. Third, we present more sophisticated methods for identifying and classifying genes with more mutations than predicted by the passenger rate (table S4). Fourth, we present a variety of tools based on gene products' sequence and structure, as well as their inclusion in certain pathways, that can help identify mutated genes that are most deserving of further attention (Figs. 1 and 2 and tables S3, S4, and S6). These tools can be used to prioritize the research that follows cancer genome-sequencing efforts.

In terms of such research, it is important to note that sequence data can inform other, independent approaches to the study of cancer genes. For example, chromodomain helicase DNA binding domain 5 (CHD5) was recently proposed to be a tumor suppressor on the basis of its functional properties and copy-number alterations (22). We identified somatic mutations in this gene in breast tumors; the combined data strongly support a role for this gene in tumorigenesis. Similarly, the NF-κB pathway member IKBKE was recently suggested to be a breast cancer oncogene on the basis of functional and expression studies (23). We found somatic mutations in several additional components of this signaling pathway (Fig. 2), reinforcing its importance in breast cancers. The transglutaminase (TGM) enzymes have recently been implicated in invasion and metastasis (24), and we identified multiple somatic mutations in TGM3 in colorectal cancers (Fig. 1). Additionally, a high-throughput retroviral insertional mutagenesis screen in mouse mammary tumor virus (MMTV)-induced mammary tumors in mice identified 33 common insertion sites as potential oncogenes (25); we found 7 of these 33 genes to be mutated in breast cancers. Given the entirely independent nature of these screens (insertional mutagenesis in mouse versus mutational analysis of human genes), the overlap of these results is remarkable.

Historically, the focus of cancer research has been on gene mountains, in part because they were the only alterations identifiable with available technologies. The ability to analyze the sequence of virtually all protein-encoding genes in cancers has shown that the vast majority of mutations in cancers, including those that are most likely to be drivers, do not occur in such mountains and emphasize the heterogeneity and complexity of human neoplasia. This new view of cancer is consistent with the idea that a large number of mutations, each associated with a small fitness advantage, drive tumor progression (26). But is it possible to make sense out of this complexity? When all the mutations that occur in different tumors are summed, the number of potential driver genes is large. But this is likely to actually reflect changes in a much more limited number of pathways, numbering no more than 20 (1). This interpretation is consistent with virtually all screens in model organisms, which have generally shown that the same phenotype can arise from alterations in any of several genes. Other recent studies lend support to this interpretation. For example, sequencing studies of the kinome in large numbers of tumors have shown that specific kinases are sometimes mutated in a small fraction of tumors of a given type (4, 10, 2729). We cannot be certain that the bulk of the low-frequency mutations observed in our study are not passengers. However, in the kinome studies, the position of mutations within the activation loop and the demonstrated effects of the target residues on kinase function unambiguously implicate many of these rare mutations as drivers. Similarly, recent analyses of myelomas suggest that there are multiple genes, each mutated in a small proportion of tumors, that can alter the same signal transduction pathway (30, 31). Furthermore, some of the low-frequency mutations observed in our study, such as activating mutations in the guanine nucleotide binding protein GNAS and a homozygous nonsense mutation in BRCA1-associated protein (BAP1), are likely to be functional (table S3). These examples, in addition to those in table S6, bolster the argument that infrequent mutations can be drivers and that they function through pathways that are already known.

Regardless of whether this pathway-centric interpretation is correct, it is clear that the “easy” part of future cancer genome research will be the identification of genetic alterations. The vast majority of subtle mutations in individual patient's tumors can now be identified with existing technology (Fig. 3), making personal cancer genomics a reality. Though understanding the precise role of these genetic alterations in tumorigenesis will be more challenging, opportunities for exploiting such personal genomic data on cancers are already apparent. For example, many of the genes altered in breast cancers appear to affect the NF-κB pathway (table S6), suggesting that drugs targeting this pathway could be efficacious in breast cancers with such mutations (30, 31). Furthermore, our data indicate that individual breast and colorectal cancers each contain ∼80 amino acid–altering mutations that are absent in all normal cells, providing a wealth of opportunities for personalized immunotherapy. Finally, any mutation identified in an individual cancer, whether driver or passenger, can be used as an exquisitely specific biomarker to guide patient management (32).

Supporting Online Material

www.sciencemag.org/cgi/content/full/1145720/DC1

Materials and Methods

Statistical Analysis Package

Fig. S1

Tables S1 to S6

References

References and Notes

View Abstract

Navigate This Article