Report

Identification and characterization of essential genes in the human genome

See allHide authors and affiliations

Science  27 Nov 2015:
Vol. 350, Issue 6264, pp. 1096-1101
DOI: 10.1126/science.aac7041

Zeroing in on essential human genes

More powerful genetic techniques are helping to define the list of genes required for the life of a human cell. Two papers used the CRISPR genome editing system and a gene trap method in haploid human cells to screen for essential genes (see the Perspective by Boone and Andrews). Wang et al.'s analysis of multiple cell lines indicates that it may be possible to find tumor-specific dependencies on particular genes. Blomen et al. investigate the phenomenon in which nonessential genes are required for fitness in the absence of another gene. Hence, complexity rather than robustness is the human strategy.

Science, this issue p. 1096 and p. 1092; see also p. 1028

Abstract

Large-scale genetic analysis of lethal phenotypes has elucidated the molecular underpinnings of many biological processes. Using the bacterial clustered regularly interspaced short palindromic repeats (CRISPR) system, we constructed a genome-wide single-guide RNA library to screen for genes required for proliferation and survival in a human cancer cell line. Our screen revealed the set of cell-essential genes, which was validated with an orthogonal gene-trap–based screen and comparison with yeast gene knockouts. This set is enriched for genes that encode components of fundamental pathways, are expressed at high levels, and contain few inactivating polymorphisms in the human population. We also uncovered a large group of uncharacterized genes involved in RNA processing, a number of whose products localize to the nucleolus. Last, screens in additional cell lines showed a high degree of overlap in gene essentiality but also revealed differences specific to each cell line and cancer type that reflect the developmental origin, oncogenic drivers, paralogous gene expression pattern, and chromosomal structure of each line. These results demonstrate the power of CRISPR-based screens and suggest a general strategy for identifying liabilities in cancer cells.

The systematic identification of essential genes in microorganisms has provided critical insights into the molecular basis of many biological processes (1). Similar studies in human cells have been hindered by the lack of suitable tools. Moreover, little is known about how the set of cell-essential genes differs across cell types and genotypes. Differentially essential genes are likely to encode tissue-specific modulators of key cellular processes and important targets for cancer therapies. We used two independent approaches for inactivating genes at the DNA level to define the cell-essential genes of the human genome.

The first approach uses the clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9–based gene editing system, which has emerged as a powerful tool to engineer the genomes of cultured cells and whole organisms (2, 3). We and others have shown that lentiviral single-guide RNA (sgRNA) libraries can enable pooled loss-of-function screens and have used the technology to uncover mediators of drug resistance and pathogen toxicity (46). To systematically identify cell-essential genes, we constructed a library, which was optimized for high cleavage activity, and performed a proliferation-based screen in the near-haploid human KBM7 chronic myelogenous leukemia (CML) cell line (Fig. 1, table S1, and supplementary text S1).

Fig. 1 Two approaches for genetic screening in human cells.

(Top) CRISPR/Cas9 method. Cells are transduced with a genome-wide sgRNA lentiviral library. Gene inactivation via Cas9-mediated genomic cleavage is directed by the 20–base pair (bp) sequence at the 5′ end of the sgRNA. Cells bearing sgRNAs targeting essential genes are depleted in the final population. (Bottom) Gene-trap method. KBM7 cells are transduced with a gene-trap retrovirus that integrates in an inactivating or “harmless” orientation at random genomic loci. Essential genes contain fewer insertions in the inactivating orientation. Sample data for two neighboring genes—RPL14, encoding an essential ribosomal protein, and ZNF619, encoding a dispensable zinc finger protein—are displayed. For CRISPR/Cas9, sgRNAs are plotted according to their target position along each gene, with the height of each bar indicating the level of depletion. Boxes indicate individual exons. For gene trap, the intronic insertion sites in each gene are plotted according to their orientation and genomic position. The height of each point is randomized.

The unusual karyotype of these cells also allows for an independent method of genetic screening. In this approach, null mutants are generated at random through retroviral gene-trap mutagenesis, selected for a phenotype, and monitored by sequencing the viral integration sites to pinpoint the causal genes (7). Positive selection–based screens by use of this method have identified genes underlying processes such as epigenetic silencing and viral infection (79). We extended this technique by developing a strategy for negative selection and conducted a screen for cell-essential genes (Fig. 1 and supplementary text S2).

For both methods, we computed a score for each gene that reflects the fitness cost imposed by inactivation of the gene. We defined the CRISPR score (CS) as the average log2 fold-change in the abundance of all sgRNAs targeting a given gene, with replicate experiments showing a high degree of reproducibility [correlation coefficient (r) = 0.90] (Fig. 2A, fig. S1A, and table S2). Of the 18,166 genes targeted by the library, 1878 scored as essential for optimal proliferation in our screen, although this precise number depends on the cutoff chosen (Fig. 2A and tables S2 and S3). Overall, this fraction represents ~10% of genes within our data set or roughly 9.2% of the entire genome (many of the genes not targeted by our library encode olfactory receptors that are unlikely to be cell-essential). Gene products that act in a non-cell-autonomous manner are not expected to score as essential in this pooled setting (fig. S1B).

Fig. 2 Identification and characterization of human cell-essential genes.

(A) CSs of all genes in the KBM7 cells. Similar proportions of cell-essential genes were identified on all autosomes. (B) KBM7 GTS distributions. No low-GTS genes were detected on the diploid chromosome 8. (C) CS and GTS of overlapping genes. (D) Yeast homolog essentiality prediction analysis. (E) Broader retention of essential genes across species. (F) Higher sequence conservation of essential genes. (G) Genes with deleterious stop-gain variants are less likely to be essential. (H) Greater connectivity of proteins encoded by essential genes. (I) Higher mRNA transcript levels of essential genes. (J) Genes with paralogs are less likely to be essential. ***P < 0.001 from Kolmogorov-Smirnov test.

We defined the gene-trap score (GTS) as the fraction of insertions in a given gene occurring in the inactivating orientation. Because the accuracy of this score depends on the depth of insertional coverage, we set a requirement on the minimum number (n = 65) of antisense inserts in a gene needed for inclusion in our analysis by measuring the concordance between replicate experiments (Fig. 2B; fig. S1, C and D; table S4; and supplementary text S2). For the 7370 genes on the haploid chromosomes that exceeded this threshold, the GTS was well-correlated with the CS and with results from a copublished study that used a similar gene-trap approach (r = 0.68) (Fig. 2C; fig. S1, E and F; and supplementary text S3). The strong correspondence between the overlapping sets of cell-essential genes defined by the two methods provides support for the accuracy of the CRISPR scores for the full set of 18,166 genes.

The two methods differed with respect to the diploid chromosome 8. Whereas the gene-trap screen failed to detect any cell-essential genes on this chromosome, the CRISPR screen uncovered a similar proportion of cell-essential genes on all the autosomes (Fig. 2, A to C). These observations indicate that (i) the vast majority of cell-essential genes are haplosufficient and (ii) biallelic inactivation occurs at high frequency in our CRISPR screen (4).

To assess the accuracy of our scores with other measures of gene essentiality, we relied on functional profiling experiments conducted in yeast Saccharomyces cerevisiae as a benchmark (1, 10). Specifically, we ranked genes common to all data sets by their scores in each data set—CRISPR, gene trap, scores from similar loss-of-function RNA interference (RNAi) screens (11), and as a naïve proxy for gene essentiality, gene expression levels determined by means of RNA sequencing (RNA-Seq)—and compared these rankings with the essentiality of yeast homologs. The CRISPR and gene-trap methods had significantly stronger correlations with the yeast results than did the RNAi screens or gene expression, which performed similarly to each other (both methods, P < 10−4, permutation test) (Fig. 2D). On the basis of additional comparisons with yeast gene essentiality, we also found that (i) our new optimized sgRNA library gave better results than those from screens using older unoptimized libraries (4, 5) and (ii) the coverage of this library (~10 constructs per gene) approaches saturation, as evidenced by down-sampling (decreasing the coverage by randomly eliminating subsets of data) (fig. S2, A and B). Together, our results suggest that scores from the CRISPR and gene-trap screens both provide accurate measures of the cell-essentiality of human genes.

Essential genes should be under strong purifying selection and should thus show greater evolutionary constraint than that of nonessential genes (12). Consistent with this expectation, the essential genes found in our screens were more broadly retained across species, showed higher levels of conservation between closely related species, and contain fewer inactivating polymorphisms within the human species, as compared with their dispensable counterparts (Fig. 2, E to G). Essential genes also tend to have higher expression and encode proteins that engage in more protein-protein interactions (1315). These patterns were also observed in our CRISPR data set (Fig. 2, H and I).

In S. cerevisiae, genes with paralogous copies in the genome show a lower degree of essentiality, presumably because of at least partial functional overlap (16). Surprisingly, meta-analysis of knockout mouse collections has suggested that there is no such correlation in mammals (17, 18). However, others have challenged this interpretation because the genes analyzed were far from a random sample (19). Using the results from our genome-wide screens, we revisited this question and observed that genes with paralogs are indeed less likely to be essential, which is consistent with the idea that paralogs can provide functional redundancy at the cellular level (Fig. 2J).

To examine the functions of the cell-essential genes, we used gene set enrichment analysis (GSEA) and found strong enrichment for many fundamental biological processes, such as DNA replication, RNA transcription, and mRNA translation (fig. S3A) (20). Whereas most of the genes could be assigned to such well-defined pathways, no function has been ascribed to ~330 of the cell-essential genes (18%) (Fig. 3A). For this set of uncharacterized genes, an analysis of the domains within their encoded gene products and comparisons with proteomic data sets from organellar purifications revealed substantial enrichment in proteins found in the nucleolus and those containing domains associated with RNA processing (fig. S3, B and C) (21).

Fig. 3 Functional characterization of previously unidentified cell-essential genes.

(A) Of the 1878 cell-essential genes identified in the KBM7 cell line, 330 genes had no known function. (B) Genes coexpressed with C16orf80, C3orf17, and C9orf114 across CCLE cell lines were associated with RNA processing. Parentheses denote the number of genes in the set. (C) Proliferation of KBM7 cells transduced with sgRNAs targeting C16orf80, C3orf17, and C9orf114, or a nontargeting control. Error bars denote SD (n = 4 experiments per group). (D) C16orf80 localized to the nucleus, and C3orf17 and C9orf114 localized to the nucleolus. Fibrillarin–red fluorescent protein (RFP) was used as a nucleolar marker. GFP, green fluorescent protein. (E) Multiple subunits of the spliceosome, RNase P/MRP, and H/ACA ribonucleoprotein complexes interact with C16orf80, C3orf17, and C9orf114, respectively.

We characterized three such genes—C16orf80, C3orf17, and C9orf114—whose mRNA expression patterns across the Cancer Cell Line Encyclopedia (CCLE) were correlated with that of genes involved in RNA processing (Fig. 3B). We validated the essentiality of these genes in short-term proliferation assays and detected localization of their products to the nucleus (C16orf80) or nucleolus (C3orf17 and C9orf114) (Fig. 3, C and D) (22). Additionally, mass spectrometric analyses of anti-FLAG-immunoprecipitates prepared from KBM7 cells expressing FLAG-tagged C16orf80, C3orf17, and C9orf114 revealed interactions with multiple subunits of the spliceosome, ribonuclease (RNase) P/MRP, and H/ACA small nucleolar ribonucleoprotein (snoRNP) complexes, respectively (Fig. 3E). These results implicate C16orf80 in splicing, which is consistent with its association with mRNAs; C3orf17 in ribosomal RNA/tRNA processing; and C9orf114 in RNA modification (23). More broadly, our results indicate that the molecular components of many critical cellular processes, especially RNA processing, have yet to be fully defined in mammalian cells.

To determine how the set of essential genes differs among cell lines, we screened another CML cell line (K562) and two Burkitt’s lymphoma cell lines (Raji and Jiyoye) using the CRISPR system (tables S2 and S3). Overall, the sets of essential genes in the four cell lines showed a high degree of overlap (Fig. 4A). Out of these four cell lines, the KBM7 CRISPR results showed the highest correlation with the KBM7 gene-trap data set, suggesting that the few differences observed are likely to be biologically meaningful (fig. S4A).

Fig. 4 Comparisons of gene essentiality across four cell lines.

(A) Heatmap of CSs across four cell lines sorted by average CS. (B) CSs of genes residing in the non-pseudoautosomal region of chromosome Y for male cell lines (Raji, Jiyoye, and KBM7). (C) Sanger sequencing of DDX3X in Raji cells reveals mutations in the 5′ splice site of intron 8. (D) Splice-site mutations in DDX3X result in a 69-bp truncation of the mRNA. Reverse transcription polymerase chain reaction (RT-PCR) primers spanning exons 8 and 9 of DDX3X (denoted by green arrows) were used to amplify cDNA from each line. (E) Essentiality of DDX3X and DDX3Y is determined through the expression and mutation status of their paralogs. (F) Proliferation of GFP- and DDX3X-expressing Raji cells transduced with sgRNAs targeting DDX3Y or an AAVS1-targeting control. Error bars denote SD (n = 4 experiments per group). (G) Analysis of genes on chromosome 22q reveals apparent “essentiality” of 61 contiguous genes in K562 residing in a region of high-copy tandem amplification. (H) Proliferation of K562 and KBM7 cells transduced with sgRNAs targeting a nongenic region within the BCR-ABL amplicon in K562 cells or a nontargeting control. Error bars denote SD (n = 4 experiments per group). (I) γH2AX (phospho-S139 H2AX) immunoblot analysis of K562 transduced with sgRNAs as in (H). S6K1 was used as a loading control. (J) Cleavage within amplified region induced erythroid differentiation in K562 cells as assessed by means of 3,3′-dimethoxybenzidine hemoglobin staining. (K) Comparison of gene essentiality between the two cancer types reveals oncogenic drivers and lineage specifiers. Genes were ranked by the difference between the average CS of each cancer type.

We focused first on genes found to be essential in only one of the four cell lines. The Raji, Jiyoye, and KBM7 cell lines had 6, 7, and 19 such genes, respectively (fig. S4B and table S5). One example was DDX3Y, which resides in the non-pseudoautosomal region of the Y chromosome and was required only in Raji cells (Fig. 4B). Its X-linked paralog, DDX3X, was essential in KBM7 and K562 cells (Fig. 4E). Both genes encode DEAD-box helicases that likely have similar cellular functions (24). Thus, the dependence on one paralog might reflect functional absence of the other paralog. Indeed, DNA sequencing of DDX3X in Raji cells revealed hemizygous mutations in the 5′-splice site of intron 8 that resulted in the production of a truncated mRNA transcript (Fig. 4, C and D). Conversely, DDX3Y was not expressed in KBM7 cells and was not present in K562 cells, which are of female origin (Fig. 4E). Introduction of wild-type DDX3X cDNA into Raji cells fully rescued the proliferation defect resulting from DDX3Y loss, indicating that the paralogous genes are essential and functionally overlapping (Fig. 4F). Essential paralogous gene pairs, involved in glucose metabolism (HK1/2 and SLC2A1/3) and cell-cycle regulation (CDK4/6), were also observed in the Jiyoye line (fig. S4C and supplementary text S4). Vulnerabilities due to the loss of a paralogous partner may serve as targets for highly personalized antitumor therapies (25).

In some cases, cell line–specific essentiality of paralogous genes did not reflect differential expression. For example, the transcription factors GATA1 and GATA2 are expressed in both K562 and KBM7 cells, but the first is specifically essential in K562 cells and the second in KBM7 cells (fig. S4D). These master regulators are known to promote proliferation and survival during distinct developmental stages in the hematopoietic lineage; GATA1 is required for the survival of erythroid progenitors, and GATA2 is required for the maintenance and proliferation of immature hematopoietic progenitors (26). These two cell types likely correspond to the cells of origin of the two CML lines (27, 28). We also identified similar instances of genes required for cell line–specific functions such as nuclear factor κB (NF-κB) pathway regulation and homology-directed DNA repair in Raji and KBM7 cells, respectively (fig. S4, E and F, and supplementary text S5).

Whereas the other three cell lines showed only a few cell line–specific essential genes, the K562 cell line was an outlier, with 63 such genes (fig. S4B). Oddly, these genes showed no discernible common functions, and many were not even expressed in the K562 cell line. (Additionally, a few encoded secreted factors whose loss would not be expected to be lethal in a pooled screen.) This mystery was resolved when we examined the chromosomal location of the genes: The majority (39 of the 63 genes) reside near 9q34 and 22q11. These two regions are translocated to produce the BCR-ABL oncogene and are present in a high-copy tandem amplification in K562 cells (fig. S5, A and B) (29). All 61 contiguous genes within the amplicon on 22q11 scored as essential, suggesting that sgRNA-mediated cleavage in this repeated region induces cytotoxicity in a manner unrelated to the function of the target genes themselves (Fig. 4G and fig. S5F). Indeed, two sgRNAs targeting nongenic sites within the amplicon were toxic to K562 but not KBM7 cells. They increased the abundance of phosphorylated histone H2AX (γH2AX), a marker of DNA damage, and induced erythroid differentiation, which occurs upon DNA damage in this cell line (Fig. 4, H to J, and fig. S5C) (30). We obtained similar results in another cell line, HEL, which contains a highly amplified region surrounding the JAK2 tyrosine kinase (fig. S5, D and E). Together, these findings indicate that lethality upon Cas9-mediated cutting may also reflect chromosome structure and therefore should be evaluated in light of copy-number information.

Last, we looked for consistent differences in essential genes between the two CML and two Burkitt’s lymphoma lines. Such genes might represent attractive targets for antineoplastic therapies because their inhibition is less likely to be broadly cytotoxic. Overall, we identified 33 genes that were specifically essential in the CML lines and 15 genes in the Burkitt’s lymphoma lines (Fig. 4K and table S6). As a control, permuted comparisons—that is, a set containing of one CML and one Burkitt’s line versus the complementary sets—showed roughly half as many “set-specific” essential genes (fig. S6, A to C).

In the CML lines, the top two genes were BCR and ABL1, which is consistent with the known essentiality of the BCR-ABL translocation product and the therapeutic effect of BCR-ABL inhibitors such as imatinib (31). Additional members of the BCR-ABL signaling pathway—SOS1, GRB2, and GAB2—scored strongly as well (ranked 3, 4, and 7, respectively). Network analysis of the other top hits also uncovered several genes encoding assembly factors for the electron transport chain, as well as enzymes involved in folate-mediated one-carbon metabolism. These results suggest additional potential targets for CML therapy (table S6).

In the B cell–derived Burkitt’s lymphoma cell lines, the top genes included three B cell–lineage transcription factors EBF1, POU2AF1, and PAX5 (ranked 3, 6, and 8, respectively). Each of these genes is the target of recurrent translocations in lymphoma (3234). Enhancers of the corresponding three gene loci all show a high level of bromodomain containing 4 (BRD4) occupancy in Ly1 cells, a related diffuse large B cell lymphoma cell line, suggesting bromodomain inhibitors such as JQ1 as potential treatments (35). Other selectively essential genes included MEF2B—a transcriptional activator of BCL6—and CCND3, both of which are frequently mutated and implicated in the pathogenesis of various lymphomas (36). Intriguingly, the top two hits, CHM and RPP25L, do not appear to have specific roles in B cells; rather, their differential essentiality is likely explained by the lack of expression of their paralogs, CHML and RPP25, in both of the Burkitt’s lymphoma cell lines studied (fig. S6D).

We used two complementary and concordant approaches, CRISPR and gene trap, to define the cell-essential genes in the human genome. Although the gene-trap method is suitable only for loss-of-function screening in rare haploid cell lines, the CRISPR method is broadly applicable. Extending our analysis across different cell lines and tumor types, we developed a framework to assess differential gene essentiality and identify potential drivers of the malignant state. The method can be readily applied to more cell lines per cancer type so as to eliminate idiosyncrasies particular to a given cell line and to more cancer types so as to systematically uncover tumor-specific liabilities that might be exploited for targeted therapies.

Supplementary Materials

www.sciencemag.org/content/350/6264/1096/suppl/DC1

Materials and Methods

Supplementary Text S1 to S5

Figs. S1 to S6

Tables S1 to S6

References (3750)

References and Notes

  1. Acknowledgments: We thank T. Mikkelsen for assistance with oligonucleotide synthesis; Z. Tsun for assistance with figures; C. Hartigan, G. Guzman, M. Schenone, and S. Carr for mass spectrometric analysis; and J. Down and J. Chen for reagents for hemoglobin staining. This work was supported by the National Institutes of Health (CA103866) (D.M.S.), the National Human Genome Research Institute (2U54HG003067-10) (E.S.L.), an award from the National Science Foundation (T.W.), and an award from the Massachussetts Institute of Technology Whitaker Health Sciences Fund (T.W.). D.M.S. is an investigator of the Howard Hughes Medical Institute. T.W., D.M.S., and E.S.L. are inventors on a U.S. patent application (PCT/US2014/062558) for functional genomics using the CRISPR-Cas system, and T.W. and D.M.S. are in the process of forming a company using this technology. The sgRNA plasmid library and other plasmids described here have been deposited in Addgene.
View Abstract

Navigate This Article