Research Article

Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics

Science  04 Oct 2013:
Vol. 342, Issue 6154,
DOI: 10.1126/science.1235587

You are currently viewing the abstract.

View Full Text

Via your Institution

Log in through your institution

Log in through your institution

Structured Abstract


Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in noncoding regions, is a current challenge. This is especially the case for somatic variants in cancer genomes, a large proportion of which are noncoding.

Embedded Image

Prioritization of candidate noncoding cancer drivers based on patterns of selection. (Step 1) Filter somatic variants to exclude 1000 Genomes polymorphisms; (2) retain variants in noncoding annotations; (3) retain those in “sensitive” regions; (4) prioritize those disrupting a transcription-factor binding motif and (5) residing near the center of a biological network; (6) prioritize ones in annotation blocks mutated in multiple cancer samples.


We investigated patterns of selection in DNA elements from the ENCODE project using the full spectrum of variants from 1092 individuals in the 1000 Genomes Project (Phase 1), including single-nucleotide variants (SNVs), short insertions and deletions (indels), and structural variants (SVs). Although we analyzed broad functional annotations, such as all transcription-factor binding sites, we focused more on highly specific categories such as distal binding sites of factor ZNF274. The greater statistical power of the Phase 1 data set compared with earlier ones allowed us to differentiate the selective constraints on these categories. We also used connectivity information between elements from protein-protein-interaction and regulatory networks. We integrated all the information on selection to develop a workflow (FunSeq) to prioritize personal-genome variants on the basis of their deleterious impact. As a proof of principle, we experimentally validated and characterized a few candidate variants.


We identified a specific subgroup of noncoding categories with almost as much selective constraint as coding genes: “ultrasensitive” regions. We also uncovered a number of clear patterns of selection. Elements more consistently active across tissues and both maternal and paternal alleles (in terms of allele-specific activity) are under stronger selection. Variants disruptive because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) are selected against. Higher network connectivity (i.e. for hubs) is associated with higher constraint. Additionally, many hub promoters and regulatory elements show evidence of recent positive selection. Overall, indels and SVs follow the same pattern as SNVs; however, there are notable exceptions. For instance, enhancers are enriched for SVs formed by nonallelic homologous recombination. We integrated these patterns of selection into the FunSeq prioritization workflow and applied it to cancer variants, because they present a strong contrast to inherited polymorphisms. In particular, application to ~90 cancer genomes (breast, prostate and medulloblastoma) reveals nearly a hundred candidate noncoding drivers.


Our approach can be readily used to prioritize variants in cancer and is immediately applicable in a precision-medicine context. It can be further improved by incorporation of larger-scale population sequencing, better annotations, and expression data from large cohorts.

Identifying Important Identifiers

Each of us has millions of sequence variations in our genomes. Signatures of purifying or negative selection should help identify which of those variations is functionally important. Khurana et al. (1235587) used sequence polymorphisms from 1092 humans across 14 populations to identify patterns of selection, especially in noncoding regulatory regions. Noncoding regions under very strong negative selection included binding sites of some chromatin and general transcription factors (TFs) and core motifs of some important TF families. Positive selection in TF binding sites tended to occur in network hub promoters. Many recurrent somatic cancer variants occurred in noncoding regulatory regions and thus might indicate mutations that drive cancer.


Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

  • A full list of participants and institutions is available in the supplementary materials.

View Full Text

Cited By...