Research Article

The Consensus Coding Sequences of Human Breast and Colorectal Cancers

See allHide authors and affiliations

Science  13 Oct 2006:
Vol. 314, Issue 5797, pp. 268-274
DOI: 10.1126/science.1133427


The elucidation of the human genome sequence has made it possible to identify genetic alterations in cancers in unprecedented detail. To begin a systematic analysis of such alterations, we determined the sequence of well-annotated human protein-coding genes in two common tumor types. Analysis of 13,023 genes in 11 breast and 11 colorectal cancers revealed that individual tumors accumulate an average of ∼90 mutant genes but that only a subset of these contribute to the neoplastic process. Using stringent criteria to delineate this subset, we identified 189 genes (average of 11 per tumor) that were mutated at significant frequency. The vast majority of these genes were not known to be genetically altered in tumors and are predicted to affect a wide range of cellular functions, including transcription, adhesion, and invasion. These data define the genetic landscape of two human cancer types, provide new targets for diagnostic and therapeutic intervention, and open fertile avenues for basic research in tumor biology.

It is widely accepted that human cancer is a genetic disease caused by sequential accumulation of mutations in oncogenes and tumor suppressor genes (1). These tumor-specific (that is, somatic) mutations provide clues to the cellular processes underlying tumorigenesis and have proven useful for diagnostic and therapeutic purposes. To date, however, only a small fraction of the genes has been analyzed, and the number and type of alterations responsible for the development of common tumor types are unknown (2). In the past, the selection of genes chosen for mutational analyses in cancer has been guided by information from linkage studies in cancer-prone families, identification of chromosomal abnormalities in tumors, or known functional attributes of individual genes or gene families (24). With the determination of the human genome sequence and recent improvements in sequencing and bioinformatic approaches, it is now possible in principle to examine the cancer cell genome in a comprehensive and unbiased manner. Such an approach not only provides the means to discover other genes that contribute to tumorigenesis, but also can lead to mechanistic insights that are only evident through a systems biological perspective. Comprehensive genetic analyses of human cancers could lead to discovery of a set of genes, linked together through a shared phenotype, that point to the importance of specific cellular processes or pathways.

To begin the systematic study of the cancer genome, we examined a major fraction of human genes in two common tumor types, breast and colorectal cancers. These cancers were chosen for study because of their substantial clinical importance worldwide; together they account for ∼2.2 million cancer diagnoses (20% of the total) and ∼940,000 cancer deaths each year (14% of the total) (5). For genetic evaluation of these tumors, we focused on a set of protein-coding genes, termed the consensus coding sequences (CCDS), that represent the most highly curated gene set currently available. The CCDS Database (6) contains full-length protein-coding genes that have been defined by extensive manual curation and computational processing and have gene annotations that are identical among reference databases.

The goals of this study were (i) to develop a methodological strategy for conducting genome-wide analyses of cancer genes in human tumors, (ii) to determine the spectrum and extent of somatic mutations in human tumors of similar and different histologic types, and (iii) to identify new cancer genes and molecular pathways that could lead to improvements in diagnosis or therapy.

Cancer mutation discovery screen. The initial step toward achieving these goals was the development of methods for high-throughput identification of somatic mutations in cancers. These methods included those for primer design, polymerase chain reaction (PCR), sequencing, and mutational analysis (Fig. 1). The first component involved extraction of all protein-coding sequences from the CCDS genes. A total of 120,839 nonredundant exons and adjacent intronic sequences were obtained from 14,661 different transcripts in CCDS. These sequences were used to design primers for PCR amplification and sequencing of exons and adjacent splice sites. Primers were designed using a number of criteria to ensure robust amplification and sequencing of template regions (7). Although most exons could be amplified in a single PCR reaction, we found that exons larger than 350 base pairs (bp) were more effectively amplified as multiple overlapping amplicons. One member of every pair of PCR primers was tailed with a universal primer sequence for subsequent sequencing reactions. A total of 135,483 primer pairs encompassing ∼21 Mb of genomic sequence were designed in this manner (table S1).

Fig. 1.

Schematic of mutation discovery and validation screens.

Eleven cell lines or xenografts of each tumor type (breast and colorectal carcinomas) were used in the discovery screen (table S2, A and B). Two matching normal samples were used as controls to help identify normal sequence variations and amplicon-specific sequencing artifacts such as those associated with GC-rich regions. A total of ∼3 million PCR products were generated and directly sequenced, resulting in 465 Mb of tumor sequence.

Sequence data were assembled for each amplicon and evaluated for quality within the target region with the use of software specifically designed for this purpose (7). The target region of each exon included all coding bases as well as the four intronic bases at both the 5′ and 3′ ends that serve as the major splice recognition sites. For an amplicon to be considered successfully analyzed, we required that ≥90% of bases in the target region have a Phred quality score—defined as –10[log10(raw per-base error)]—of at least 20 in at least three-quarters of the tumor samples analyzed (8). This quality cut-off was chosen to provide high sensitivity for mutation detection while minimizing false positives. Using these criteria, 93% of the 135,483 amplicons and 90% of the total targeted bases in CCDS were successfully analyzed for potential alterations.

Examination of sequence traces from these amplicons revealed a total of 816,986 putative nucleotide changes. Because the vast majority of changes that did not affect the amino acid sequence (i.e., synonymous or silent substitutions) were likely to be nonfunctional, these changes were not analyzed further. The remaining 557,029 changes could represent germline variants, artifacts of PCR or sequencing, or bona fide somatic mutations. Several bioinformatic and experimental steps were used to distinguish among these possibilities. First, any alterations that were also present in either of the two normal samples included in the discovery screen were removed, as these were likely to represent common germline polymorphisms or sequence artifacts. Second, as these two normal control samples would be expected to contain only a subset of known variants, any change corresponding to a germline polymorphism found in single-nucleotide polymorphism (SNP) databases was also removed (7). Finally, the sequence trace of each potential alteration was visually inspected so as to remove false positive calls in the automated analysis. The combination of these data analysis efforts was efficient, removing ∼96% of the potential alterations and leaving 29,281 for further scrutiny (Fig. 1).

To ensure that the observed mutations did not arise artifactually during the PCR or sequencing steps, we independently reamplified and resequenced the regions containing them in the corresponding tumors. This step removed 9295 alterations. The regions containing the putative mutations were then sequenced in matched normal DNA samples to determine whether the mutations were truly somatic: 18,414 changes were observed to be present in the germ line of these patients, representing variants not currently annotated in SNP databases, and were excluded. As a final step, the remaining 1572 putative somatic mutations were carefully examined in silico to ensure that the alterations did not arise from mistargeted sequencing of highly related regions occurring elsewhere in the genome (7). Alterations in such duplicated regions may appear to be somatic when there is loss of one or both alleles of the target region in the tumor and when the selected primers closely match and therefore amplify similar areas of the genome. A total of 265 changes in closely related regions were excluded in this fashion, resulting in a total of 1307 confirmed somatic mutations in 1149 genes (Table 1).

Table 1.

Summary of somatic mutations. Numbers in parentheses refer to percentage of total mutations.

TumorDiscovery screenView inlineValidation screenView inlineBoth screens combined
Number of mutated genes 519 673 1149 105 137 236 519 673 1149
Number of mutations 574 733 1307 177 188 365 751 921 1672
Nonsynonymous mutations in coding sequences
    Missense 482 (84.0) 600 (81.9) 1082 (82.8) 126 (71.2) 145 (77.1) 271 (74.2) 608 (81.0) 745 (80.9) 1353 (80.9)
    Nonsense 35 (6.1) 39 (5.3) 74 (5.7) 26 (14.7) 8 (4.3) 34 (9.3) 61 (8.1) 47 (5.1) 108 (6.5)
    Insertion 3 (0.5) 3 (0.4) 6 (0.5) 2 (1.1) 2 (1.1) 4 (1.1) 5 (0.7) 5 (0.5) 10 (0.6)
    Deletion 18 (3.1) 48 (6.5) 66 (5.0) 10 (5.6) 13 (6.9) 23 (6.3) 28 (3.7) 61 (6.6) 89 (5.3)
    Duplication 17 (3.0) 2 (0.3) 19 (1.5) 3 (1.7) 12 (6.4) 15 (4.1) 20 (2.7) 14 (1.5) 34 (2.0)
Mutations in noncoding sequences
    Splice siteView inline 17 (3.0) 37 (5.0) 54 (4.1) 9 (5.1) 8 (4.3) 17 (4.7) 26 (3.5) 45 (4.9) 71 (4.2)
    UTRView inline 2 (0.3) 4 (0.5) 6 (0.5) 1 (0.6) 0 (0.0) 1 (0.3) 3 (0.4) 4 (0.4) 7 (0.4)
Nucleotides successfully analyzed (Mb)View inline 208.5 209.2 417.7 28.7 34.3 63.0 237.2 243.5 480.7
Mutation frequency (mutations/Mb) 2.8 3.5 3.1 6.2 5.5 5.8 3.2 3.8 3.5
  • View inline* Coding and adjacent noncoding regions of 13,023 CCDS genes were sequenced in 11 colorectal and 11 breast cancers.

  • View inline Genes mutated in the discovery screen were sequenced in 24 additional tumor samples of the affected tumor type.

  • View inline Intronic mutations within 4 bp of exon/intron boundary.

  • View inline§ Mutations in untranslated regions (UTR) within 4 bp 5′ of initiation codon or 4 bp 3′ of termination codon.

  • View inline Nucleotides with Phred quality score of at least 20.

  • Validation screen. To evaluate the prevalence and spectrum of somatic mutations in these 1149 genes, we determined their sequence in additional tumors of the same histologic type (Fig. 1) (table S2, A and B). Genes mutated in at least one breast or colorectal tumor in the discovery screen were analyzed in 24 additional breast or colorectal tumors, respectively. This effort involved 453,024 additional PCR and sequencing reactions encompassing 77 Mb of tumor DNA. A total of 133,693 putative changes were identified in the validation screen. Methods similar to those used in the discovery screen were used to exclude silent changes, known and novel germline variants, false positives arising from PCR or sequencing artifacts, and apparent changes that were likely due to coamplification of highly related genes. Additionally, any changes corresponding to germline variants not found in SNP databases but identified in the discovery screen were excluded. The regions containing the remaining 4948 changes were reamplified and resequenced in the corresponding tumors (to ensure reproducibility) and in matched normal tissue to determine if they were somatic. An additional 365 somatic mutations in 236 genes were identified in this manner. In total, 921 and 751 somatic mutations were identified in breast and colorectal cancers, respectively (Fig. 1, Table 1, and table S4).

    Mutation spectrum. The great majority of the 1672 mutations observed in the discovery or validation screens were single base substitutions: 81% of the mutations were missense, 7% were nonsense, and 4% were altered splice sites (Table 1). The remaining 8% were insertions, deletions, and duplications ranging from 1 to 110 nucleotides in length. Although the fraction of mutations that were single base substitutions was similar in breast and colorectal cancers, the spectrum and nucleotide contexts of the substitution mutations were very different between the two tumor types. The most striking of these differences occurred at C:G base pairs: 59% of the 696 colorectal cancer mutations were C:G to T:A transitions, whereas only 7% were C:G to G:C transversions (Table 2 and table S3). In contrast, only 35% of the mutations in breast cancers were C:G to T:A transitions, whereas 29% were C:G to G:C transversions. In addition, a large fraction (44%) of the mutations in colorectal cancers were at 5′-CpG-3′ dinucleotide sites, but only 17% of the mutations in breast cancers occurred at such sites. This 5′-CpG-3′ preference led to an excess of nonsynonymous mutations, resulting in changes of arginine residues in colorectal cancers but not in breast cancers (fig. S1). In contrast, 31% of mutations in breast cancers occurred at 5′-TpC-3′ sites (or complementary 5′-GpA-3′ sites), whereas only 11% of mutations in colorectal cancers occurred at these dinucleotide sites. The differences noted above were all highly significant (P < 0.0001) (7) and have substantial implications for the mechanisms underlying mutagenesis in the two tumor types.

    Table 2.

    Spectrum of single base substitutions. Base substitutions in coding sequences resulting in nonsynonymous changes as well as substitutions in noncoding sequences are included (see Table 1). Numbers in parentheses indicate percentage of total mutations.

    TumorDiscovery screenValidation screenBoth screens combined
    Total number of substitutions 535 678 1213 161 160 321 696 838 1534
    Substitutions at C:G base pairs
        C:G → T:A 325 (60.7) 230 (33.9) 555 (45.8) 88 (54.7) 59 (36.9) 147 (45.8) 413View inline (59.3) 289View inline (34.5) 702 (45.8)
        C:G → G:C 36 (6.7) 207 (30.5) 243 (20.0) 12 (7.5) 32 (20.0) 44 (13.7) 48View inline (6.9) 239View inline (28.5) 287 (18.7)
        C:G → A:T 70 (13.1) 110 (16.2) 180 (14.8) 23 (14.3) 38 (23.8) 61 (19.0) 93 (13.4) 148 (17.7) 241 (15.7)
    Substitutions at T:A base pairs
        T:A → C:G 42 (7.9) 54 (8.0) 96 (7.9) 14 (8.7) 18 (11.3) 32 (10.0) 56 (8.0) 72 (8.6) 128 (8.3)
        T:A → G:C 38 (7.1) 30 (4.4) 68 (5.6) 13 (8.1) 5 (3.1) 18 (5.6) 51 (7.3) 35 (4.2) 86 (5.6)
        T:A → A:T 24 (4.5) 47 (6.9) 71 (5.9) 11 (6.8) 8 (5.0) 19 (5.9) 35 (5.0) 55 (6.6) 90 (5.9)
    Substitutions at specific dinucleotidesView inline
        5′-CpG-3′ 254 (47.5) 115 (17.0) 369 (30.4) 55 (34.2) 24 (15.0) 79 (24.6) 309View inline (44.4) 139View inline (16.6) 448 (29.2)
        5′-TpC-3′ 54 (10.1) 235 (34.7) 289 (23.8) 25 (15.5) 22 (13.8) 47 (14.6) 79View inline (11.4) 257View inline (30.7) 336 (21.9)
  • View inline* Values in this category were significantly different between breast and colorectal cancers (P < 0.0001).

  • View inline Includes substitutions at the C or G of the 5′-CpG-3′ dinucleotide, the C of the 5′-TpC-3′ dinucleotide, or the G of the 5′-GpA-3′ dinucleotide.

  • Distinction between passenger and nonpassenger mutations. Somatic mutations in human tumors can arise either through selection of functionally important alterations (via their effect on net cell growth) or through accumulation of nonfunctional “passenger” alterations that arise during repeated rounds of cell division in the tumor or in its progenitor stem cell. In light of the relatively low rates of mutation in human cancer cells (9, 10), a distinction between selected and passenger mutations is generally not required when the number of genes and tumors analyzed is small. In large-scale studies, however, such distinctions are of paramount importance (11, 12). For example, it has been estimated that nonsynonymous passenger mutations are present at a frequency no higher than ∼1.2 per Mb of DNA in cancers of the breast or colon (1315). Because we assessed 542 Mb of tumor DNA, we would therefore have expected to observe ∼650 passenger mutations. We actually observed 1672 mutations (Table 1), many more than what would have been predicted to occur by chance (P < 1 × 10–10) (7). Moreover, the frequency of mutations in the validation screen was significantly higher than in the discovery screen (5.8 versus 3.1 mutations per Mb, P < 1 × 10–10; Table 1). The mutations in the validation screen were also enriched for nonsense, insertion, deletion, duplication, and splice site changes relative to the discovery screen; each of these would be expected to have a functional effect on the encoded proteins.

    To distinguish genes likely to contribute to tumorigenesis from those in which passenger mutations occurred by chance, we first excluded genes that were not mutated in the validation screen. We next developed statistical methods to estimate the probability that the number of mutations in a given gene was greater than expected from the background mutation rate. For each gene, this analysis incorporated the number of somatic alterations observed in either the discovery or validation screens, the number of tumors studied, and the number of nucleotides successfully analyzed (as indicated by the number of bases with Phred quality scores of ≥20). Because the mutation frequencies varied with nucleotide type and context and were different in breast versus colorectal cancers (Table 2), these factors were included in the calculations. The output of this analysis was a cancer mutation prevalence (CaMP) score for each gene analyzed. The CaMP score reflects the probability that the number of mutations observed in a gene reflects a mutation frequency that is higher than that expected to be observed by chance given the background mutation rate; its derivation is based on principles described in (7). The use of the CaMP score for analysis of somatic mutations is analogous to the use of the lod score for linkage analysis in familial genetic settings. For example, 90% of the genes with CaMP scores of >1.0 are predicted to have mutation frequencies higher than the background mutation frequency.

    Candidate cancer genes. A complete list of the somatic mutations identified in this study is provided in table S4. Validated genes with CaMP scores greater than 1.0 were considered to be candidate cancer genes (CAN genes). The combination of experimental validation and statistical calculation thereby yielded four nested sets of genes: Of 13,023 genes evaluated, 1149 were mutated, 236 were validated, and 189 were CAN genes. Among these, the CAN genes were most likely to have been subjected to mutational selection during tumorigenesis. There were 122 and 69 CAN genes identified in breast and colorectal cancers, respectively (tables S5 and S6). Individual breast cancers examined in the discovery screen harbored an average of 12 (range 4 to 23) mutant CAN genes, whereas the average number of CAN genes in colorectal cancers was 9 (range 3 to 18) (table S3). Interestingly, each cancer specimen of a given tumor type carried its own distinct CAN-gene mutational signature, as no cancer had more than six mutant CAN genes in common with any other cancer (tables S4 to S6).

    CAN genes could be divided into three classes: (i) genes previously observed to be mutationally altered in human cancers, (ii) genes in which no previous mutations in human cancers had been discovered but had been linked to cancer through functional studies, and (iii) genes with no previous strong connections to neoplasia.

    The reidentification of genes that had been previously shown to be somatically mutated in cancers represented a critical validation of the approach used in this study. All of the CCDS genes previously shown to be mutated in >10% of either breast or colorectal cancers were found to be CAN genes in the current study. These included TP53 (2), APC (2), KRAS (2), SMAD4 (2), and FBXW7 (CDC4) (16) (tables S4 to S6). In addition, we identified mutations in genes whose mutation prevalence in sporadic cancers was rather low. These genes included EPHA3 (17), MRE11A (18), NF1 (2), SMAD2 (19, 20), SMAD3 (21), TCF7L2 (TCF4) (22), BRCA1 (2), and TGFBRII (23). We also detected mutations in genes that had been previously found to be altered in human tumors but not in the same tumor type identified in this study. These included GNAS (guanine nucleotide binding protein, α stimulating) (24), KEAP1 (kelch-like ECH-associated protein) (25), RET (a proto-oncogene) (2), and TCF1 (a transcription factor) (26). Finally, we found mutations in a number of genes that have been previously identified as targets of translocation or amplification in human cancers. These included NUP214 (a nucleoporin) (2), KTN1 (a kinesin receptor) (27), DDX10 (DEAD box polypeptide 10) (28), GLI1 (glioma-associated oncogene homolog 1) (29), and MTG8 (the translocation target gene of runt-related transcription factor 1, RUNX1T1) (2). We conclude that if these genes had not already been shown to play a causative role in human tumors, they would have been discovered through the approach taken in this study. By analogy, the 167 other CAN genes in tables S5 and S6 are likely to play important roles in breast, colorectal, and perhaps other types of cancers.

    Although genetic alterations currently provide the most reliable indicator of a gene's importance in human neoplasia (1, 30), many other genes are thought to play key roles on the basis of functional or expression studies. Our study provides genetic evidence supporting the importance of several of these genes in neoplasia. For example, we discovered intragenic mutations in EPHB6 (an ephrin receptor) (31), MLL3 (mixed-lineage leukemia 3) (32), GSN (gelsolin) (33), CDH10 and CDH20 (cadherins), FLNB (actin and SMAD binding protein filamin B) (34), PTPRD (protein tyrosine phosphatase receptor) (35), and AMFR (autocrine motility factor receptor) (36).

    In addition to these two classes of genes, our study revealed a large number of genes that had not been strongly suspected to be involved in cancer. This third class of genes included PKHD1 (polycystic kidney and hepatic disease 1), GUCY1A2 (guanylate cyclase 1), TBX22 (a transcription factor), SEC8L1 (an exocyst complex component), TTLL3 (a tubulin tyrosine ligase), ATP8B1 (an ATP-dependent transporter), CUBN (an intrinsic factor-cobalamin receptor), DBN1 (an actin binding protein), and TECTA (tectorin α). In addition, seven CAN genes corresponded to genes for which no biologic role has yet been established.

    We examined the distribution of mutations within CAN-gene products to see whether clustering occurred in specific regions or functional domains. In addition to the well-documented hotspots in TP53 (37) and KRAS (38), we identified three mutations in GNAS in colorectal cancers that affected a single amino acid residue (Arg201). Alterations of this residue have previously been shown to lead to constitutive activation of the encoded heterotrimeric guanine nucleotide–binding protein (G protein) αs through inhibition of guanosine triphosphatase (GTPase) activity (24). Two mutations in the EGF-like gene EGFL6 in breast tumors affected the same nucleotide position and resulted in a Leu508 → Phe change in the MAM adhesion domain. A total of seven genes had alterations located within five amino acid residues of each other, and an additional 12 genes had clustering of multiple mutations within a specific protein domain (13 to 78 amino acids apart). Thirty-one of 40 of these changes affected residues that were evolutionarily conserved. Although the effects of these alterations are unknown, their clustering suggests specific roles for the mutated regions in the neoplastic process.

    CAN-gene groups. An unbiased screen of a large set of genes can provide insights into pathogenesis that would not be apparent through single-gene mutational analysis. This has been exemplified by large-scale mutagenesis screens in experimental organisms (3941). We therefore attempted to assign each CAN gene to a functional group based on Gene Ontology (GO) molecular function or biochemical process groups, the presence of specific INTERPRO sequence domains, or previously published literature (Table 3 and Fig. 2). Several of the groups identified in this way were of special interest. For example, 22 of the 122 (18%) breast CAN genes and 13 of the 69 (19%) colorectal CAN genes were transcriptional regulators. At least one of these genes was mutated in more than 80% of the tumors of each type. Zinc-finger transcription factors were particularly highly represented (eight genes mutated collectively in 43% of breast cancer samples). Similarly, genes involved in cell adhesion represented ∼22% of CAN genes and affected more than two-thirds of tumors of either type. Genes involved in signal transduction represented ∼23% of CAN genes, and at least one such gene was mutated in 77% and 94% of the breast and colorectal cancer samples, respectively. Subsets of these groups were also of interest and included metalloproteinases (part of the cell adhesion and motility group and mutated in 37% of colorectal cancers) and G proteins and their regulators (part of the signal transduction group and altered in 43% of breast cancers). These data suggest that dysregulation of specific cellular processes is genetically selected during neoplasia and that distinct members of each group may serve similar roles in different tumors.

    Fig. 2.

    Mutation frequency of CAN-gene groups. CAN genes were grouped by function with the use of Gene Ontology groups, INTERPRO domains, and available literature. Bars indicate the fraction of tumors (35 breast or 35 colorectal) with at least one mutated gene in the functional group.

    Table 3.

    Functional classification of CAN genes, with CaMP score to the right of each gene name. CAN genes were assigned to functional classes using Gene Ontology (GO) groups, INTERPRO domains, and available literature. Representative GO groups and INTERPRO domains are listed for each class.

    Breast cancersColorectal cancers
    Cellular adhesion and motility (examples: cytoskeletal protein binding GO:0008092, cell adhesion GO:0007155, metallopeptidase activity GO:0008237)
    FLNB 3.4 TMPRSS6 2.0 RAPH1 1.4 PKHD1 3.5 CNTN4 1.6
    MYH1 2.7 COL11A1 1.8 PCDHB15 1.4 ADAMTSL3 3.3 CHL1 1.3
    SPTAN1 2.6 DNAH9 1.7 CMYA1 1.4 OBSCN 3.0 HAPLN1 1.2
    DBN1 2.5 OBSCN 1.7 MACF1 1.3 ADAMTS18 2.7 MGC33407 1.2
    TECTA 2.4 COL7A1 1.5 SYNE2 1.3 MMP2 2.3 MAP2 1.0
    ADAM12 2.3 MAGEE1 1.5 NRCAM 1.1 TTLL3 2.2
    GSN 2.2 CDH10 1.5 COL19A1 1.1 EVL 2.0
    CDH20 2.2 SULF2 1.5 SEMA5B 1.1 ADAM29 2.0
    BGN 2.1 CNTN6 1.4 ITGA9 1.1 CSMD3 1.9
    ICAM5 2.1 THBS3 1.4 ADAMTS15 1.8
    Signal transduction (examples: intracellular signaling cascade GO:0007242, receptor activity GO:0004872, GTPase regulator GO:0030695)
    VEPH1 2.1 PFC 1.5 PRPF4B 1.3 APC >10 PTPRD 2.2
    SBNO1 2.1 GAB1 1.5 CENTG1 1.3 KRAS >10 MCP 2.1
    DNASE1L3 1.9 ARHGEF4 1.4 MAP3K6 1.3 EPHA3 4.2 NF1 1.9
    RAP1GA1 1.8 NALP8 1.4 APC2 1.3 GUCY1A2 3.5 PTPRU 1.4
    EGFL6 1.8 RGL1 1.4 STARD8 1.2 EPHB6 3.5 CD109 1.3
    AMFR 1.7 PPM1E 1.4 PTPN14 1.1 TGFBR2 2.9 PHIP 1.2
    CENTB1 1.7 PKDREJ 1.4 IRTA2 1.1 GNAS 2.6
    GPNMB 1.7 CNNM4 1.3 RASGRF2 1.1 RET 2.3
    INHBE 1.7 ALS2CL 1.3 MTMR3 1.1 P2RY14 2.2
    FLJ10458 L.6 RASAL2 1.3 LGR6 2.2
    Transcriptional regulation (examples: regulation of transcription GO:0045449, zinc finger C2H2-subtype IPR007066)
    TP53 >10 CHD5 1.8 ZFP64 1.4 TP53 >10 ZNF442 1.9
    FLJ13479 3.4 CIC 1.7 ZNF569 1.4 SMAD4 4.6 SMAD3 1.9
    SIX4 2.5 KEAP1 1.6 EHMT1 1.3 MLL3 3.7 EYA4 1.5
    KIAAO934 2.5 HOXA3 1.6 ZFYVE26 1.2 TBX22 3.3 PKNOX1 1.4
    LRRFIP1 2.4 TCF1 1.6 BCL11A 1.1 SMAD2 3.1 MKRN3 1.3
    GLI1 2.3 HDAC4 1.6 ZNF318 1.1 TCF7L2 2.8
    RFX2 2.1 MYOD1 1.5 HIST1H1B 2.5
    ZCSL3 1.8 NCOA6 1.5 RUNX1T1 2.4
    Transport (examples: ion transporter activity GO:0015075, ligand-gated ion channel activity GO:0015276, carrier activity GO:0005386)
    ATP8B1 3.1 ABCB8 1.7 ABCB10 1.4 ABCA1 2.8 C6orf29 1.1
    CUBN 2.5 KPNA5 1.7 SCNN1B 1.3 SLC29A1 1.9
    GRIN2D 2.4 ABCA3 1.7 NUP133 1.1 SCN3B 1.9
    HDLBP 2.2 SLC9A2 1.6 P2RX7 1.3
    NUP214 1.8 SLC6A3 1.5 KCNQ5 1.2
    Cellular metabolism (examples: aromatic compound metabolism GO:0006725, generation of precursor metabolites GO:0016445. biosynthesis GO:0009058)
    ACADM 2.0 NCB50R 1.7 PHACS 1.4 UQCRC2 1.9
    PRPS1 1.8 ASL 1.6 XDH 1.3 ACSL5 1.6
    CYP1A1 1.7 GALNT5 1.4 GALNS 1.2
    Intracellular trafficking (examples: endoplasmic reticulum targeting sequence IPR000866, membrane fusion GO:0006944)
    OTOF 2.2 PLEKHA8 1.8 KTN1 1.5 SYNE1 2.3 PRKD1 1.9
    LRBA 2.1 LOC283849 1.7 GGA1 1.4 SEC8L1 2.2 LRP2 1.2
    AEGP 1.8 SORL1 1.7 SDBCAG84 2.2
    RNA metabolism (examples: RNA processing GO:0008353, RNA splice site selection GO:0006376)
    C14orf155 3.3 RNU31P2 1.7 KIAA0427 1.5 SFRS6 1.3
    SP110 1.8 C22orfl9 1.5 DDX10 1.3
    Other (examples: response to DNA damage stimulus GO:0006974, protein ubiquitination GO:0016567)
    FLJ40869 2.1 SERPINB1 1.4 FBXW7 5.1 K6IRS3 1.2
    BRCA1 2.0 UHRF2 1.5 CD248 1.2
    MRE11A 1.6 LMO7 1.3 ERCC6 1.0
    KIAA1632 2.4 KIAA0999 1.3 C10orf137 2.7 KIAA1409 1.6
    MGC24047 2.1 LOC157697 2.0 C15orf2 1.0

    Discussion. Four important points have emerged from this comprehensive mutational analysis of human cancer. First, a relatively large number of previously uncharacterized CAN genes exist in breast and colorectal cancers, and these genes can be discovered by unbiased approaches such as that used in our study. These results support the notion that large-scale mutational analyses of other tumor types will prove useful for identifying genes not previously known to be linked to human cancer.

    Second, our results suggest that the number of mutational events occurring during the evolution of human tumors from a benign to a metastatic state is much larger than previously thought. We found that breast and colorectal cancers harbor an average of 52 and 67 nonsynonymous somatic mutations in CCDS genes, of which an average of 9 and 12, respectively, were in CAN genes (table S3). These data can be used to estimate the total number of nonsynonymous mutations in coding genes that arise in a “typical” cancer through sequential rounds of mutation and selection. If we assume that the mutation prevalence in genes that have not yet been sequenced is similar to that of the genes so far analyzed, we estimate that there are 81 and 105 mutant genes (average 93) in the typical colorectal or breast cancer, respectively (7). Of these, an average of 14 and 20, respectively, would be expected to be CAN genes. In addition to the CAN genes, there were other mutated CCDS genes that were likely to have been selected for during tumorigenesis but were not altered at a frequency high enough to warrant confidence in their interpretation.

    A third point emerging from our study is that breast and colorectal cancers show substantial differences in their mutation spectra. In colorectal cancers, a bias toward C:G to T:A transitions at 5′-CpG-3′ sites was previously noted in TP53 (42). Our results suggest that this bias is genome-wide rather than representing a selection for certain nucleotides within TP53. This bias may reflect a more extensive methylation of 5′-CpG-3′ dinucleotides in colorectal cancers than in breast cancers, or it may be an effect of dietary carcinogens (43, 44). In breast cancers, the fraction of mutations at 5′-TpC-3′ sites was far higher in the CCDS genes examined in this study than previously reported for TP53 (37). It has been noted that a small fraction of breast tumors may have a defective repair system, resulting in 5′-TpC-3′ mutations (15). Our studies confirm that some breast cancers have higher fractions of 5′-TpC-3′ mutations than others, but also show that mutations at this dinucleotide are generally more frequent than in colorectal cancers (Table 2 and table S3).

    Finally, our results reveal that there are substantial differences in the panel of CAN genes mutated in the two tumor types (Table 3). For example, metalloproteinase genes were mutated in a large fraction of colorectal but only in a small fraction of breast cancers (tables S5 and S6). Transcriptional regulator genes were mutated in a high fraction of both breast and colorectal tumors, but the specific genes affected varied according to tumor type (Table 3). There was also considerable heterogeneity among the CAN genes mutated in different tumor specimens derived from the same tissue type (tables S4 to S6). It has been documented that virtually all biochemical, biological, and clinical attributes are heterogeneous within human cancers of the same histologic subtype (45). Our data suggest that differences in the genes mutated in various tumors could account for a major part of this heterogeneity. This might explain why it has been so difficult to correlate the behavior, prognosis, or response to therapy of common solid tumors with the presence or absence of a single gene alteration; such alterations reflect only a small component of each tumor's mutational composition. On the other hand, disparate genes contributing to cancer are often functionally equivalent, affecting net cell growth through the same molecular pathway (1). Thus, TP53 and MDM2 mutations exert comparable effects on cells, as do mutations in RB1, CDKN2A (p16), CCND1, and CDK4. It will be of interest to determine whether a limited number of pathways include most CAN genes, a possibility consistent with the groupings in Fig. 2 and Table 3.

    Like a draft version of any genome project, our study has limitations. First, only genes present in the current version of CCDS were analyzed; of the genes not yet included, there are ∼5000 genes for which excellent supporting evidence exists (46). Second, we were not able to successfully sequence ∼10% of the bases within the coding sequences of the 13,023 CCDS genes (equivalent to 1302 unsequenced genes). Third, our identification of genes mutated at significant frequencies assumed that the background mutation frequency was constant throughout the genome. Although it cannot currently be determined whether certain genomic regions have higher background mutation frequencies, we have included the number of mutations observed per Mb sequenced in tables S5 and S6 to facilitate such analyses in the future. Fourth, although our screen would be expected to identify the most common types of mutations found in cancers, some genetic alterations—including mutations in noncoding genes, mutations in noncoding regions of coding genes, relatively large deletions or insertions, amplifications, and translocations—would not be detectable by the methods we used. Future studies using a combination of different technologies, such as those envisioned by the Cancer Genome Atlas Project (47), will be able to address these issues.

    The results of this study inform future cancer genome sequencing efforts in several important ways:

    1. A major technical challenge of such studies will be discerning somatic mutations from the large number of sequence alterations identified. In our study, 557,029 nonsynonymous sequence alterations were detected in the discovery screen, but after subsequent analyses only 0.23% of these were identified as legitimate somatic mutations (Fig. 1). Fewer than 10% of the putative nonsynonymous alterations were known polymorphisms; many of the rest were uncommon germline variants or sequence artifacts that were not reproducible. Inclusion of matched normal samples and sequencing both strands of each PCR product would reduce false positives in the discovery screen but would increase the cost of sequencing by a factor of 4. Although recently developed sequencing methods could reduce the cost of such studies in the future (48), the higher error rates of these approaches may result in an even lower ratio of bona fide somatic mutations to putative alterations.

    2. Another technical issue is that careful design of primers is important to eliminate sequence artifacts due to the inadvertent amplification and sequencing of related genes. The primer pairs that resultedinsuccessfulamplification and sequencing represent a valuable resource in this regard. Even with well-designed primers, it is essential to examine any observed mutation to ensure that it is not found as a normal variant in a related gene.

    3. Although it is likely that studies of other solid tumor types will also identify a large number of somatic mutations, it will be important to apply rigorous approaches to identify those mutations that have been selected for during tumorigenesis. Statistical techniques, such as those used in this study or described by Greenman et al. (11), can provide strong evidence for selection of mutated genes. These approaches are likely to improve as more cancer genomic sequencing data are accumulated through the Cancer Genome Atlas Project (47) and other projects now under way.

    4. There has been much discussion about which genes should be the focus of future sequencing efforts. Our results suggest that many genes not previously implicated in cancer are mutated at significant levels and may provide novel clues to pathogenesis. From these data, it would seem that large-scale unbiased screens of coding genes may be more informative than screens based on previously defined criteria.

    5. The results also raise questions about the optimum number of tumors of any given type that should be assessed in a cancer genome study. Our study was designed to determine the nature and types of alterations present in an “average” breast or colorectal cancer and to discover genes mutated at reasonably high frequencies. With this design, our power to detect genes mutated in more than 20% of tumors of a given type was 90%, but only 50% of genes mutated in 6% of tumors would have been discovered. Detection of genes mutated in 6% or 1% of tumors with >99% probability in a discovery screen would require sequence determination of at least 75 or 459 tumors, respectively. Although it will be impossible to detect all mutations that may occur in tumors, strategies that would identify the most important ones at an affordable cost can be envisioned on the basis of the data and analysis reported herein.

    6. Ultimately, the sequences of entire cancer genomes, including intergenic regions, will be obtainable. Our studies demonstrate the inherent difficulties in determining the consequences of somatic mutations, even those that alter the amino acid sequence of highly annotated and well-studied genes. Establishing the consequences of mutations in noncoding regions of the genome will likely be much more difficult. Until new tools for solving this problem become available, it is likely that gene-centric sequencing analyses of cancer will be more useful than whole-genome sequencing.

    Our results provide a large number of future research opportunities in human cancer. For genetics, it will be of interest to elucidate the timing and extent of CAN-gene mutations in breast and colorectal cancers, whether these genes are mutated in other tumor types, and whether germline variants in CAN genes are associated with cancer predisposition. For immunology, the finding that tumors contain an average of ∼90 different amino acid substitutions not present in any normal cell can provide novel approaches to engender antitumor immunity. For epidemiology, the remarkable difference in mutation spectra of breast and colorectal cancers suggests the existence of organ-specific carcinogens. For cancer biology, it is clear that no current animal or in vitro model of cancer recapitulates the genetic landscape of an actual human tumor. Understanding and capturing this landscape and its heterogeneity may provide models that more successfully mimic the human disease. For epigenetics, it is possible that a subset of CAN genes can also be dysregulated in tumors through changes in chromatin or DNA methylation rather than through mutation. For diagnostics, the CAN genes define a relatively small subset of genes that could prove useful as markers for neoplasia. Finally, some of these genes, particularly those on the cell surface or those with enzymatic activity, may prove to be good targets for therapeutic development.

    Supporting Online Material

    Materials and Methods

    Figs. S1 and S2

    Tables S1 to S5


    References and Notes

    Stay Connected to Science

    Navigate This Article