Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing

See allHide authors and affiliations

Science  22 May 2015:
Vol. 348, Issue 6237, pp. 910-914
DOI: 10.1126/science.aab1601

Chromatin state and the single cell

Identifying the chromatin state of any single cell, which may or may not have a different function or represent different stages relative to others collected within any single culture, experiment, or tissue, has been challenging. Cusanovitch et al. skirted previously identified technological limitations to identify regions of accessible chromatin at single-cell resolution. Combinatorial cellular indexing, a strategy for multiplex barcoding of thousands of single cells per experiment, was successfully used to investigate the genome-wide chromatin accessibility landscape in each of over 15,000 single cells.

Science, this issue p. 910


Technical advances have enabled the collection of genome and transcriptome data sets with single-cell resolution. However, single-cell characterization of the epigenome has remained challenging. Furthermore, because cells must be physically separated before biochemical processing, conventional single-cell preparatory methods scale linearly. We applied combinatorial cellular indexing to measure chromatin accessibility in thousands of single cells per assay, circumventing the need for compartmentalization of individual cells. We report chromatin accessibility profiles from more than 15,000 single cells and use these data to cluster cells on the basis of chromatin accessibility landscapes. We identify modules of coordinately regulated chromatin accessibility at the level of single cells both between and within cell types, with a scalable method that may accelerate progress toward a human cell atlas.

Chromatin state is dynamically regulated in a cell type–specific manner (1, 2). To identify active regulatory regions, sequencing of deoxyribonuclease I (DNase I) digestion products [DNase-seq (3)] and assay for transposase-accessible chromatin using sequencing [ATAC-seq (4)] measure the degree to which specific regions of chromatin are accessible to regulatory factors. However, these assays measure an average of the chromatin states within a population of cells, masking heterogeneity between and within cell types.

Single-cell methods for genome sequence (5), transcriptomes (610), DNA methylation (11), and chromosome conformation (12) have been reported. However, we presently lack technologies for genome-wide, single-cell characterization of chromatin state. Furthermore, a limitation of most such methods is that single cells are individually compartmentalized, and the nucleic acid content of each cell is biochemically processed within its own reaction volume (1316). Processing of large numbers of cells in this way can be expensive and labor intensive, and it is difficult to work with single cells, small volumes, and low nucleic acid inputs.

We recently used combinatorial indexing of genomic DNA fragments for haplotype resolution or de novo genome assembly (17, 18). Here, we adapt the concept of combinatorial indexing to intact nuclei to acquire data from thousands of single cells without requiring their individualized processing (Fig. 1A). First, we molecularly barcode populations of nuclei in each of many wells. We then pool, dilute, and redistribute intact nuclei to a second set of wells, introduce a second barcode, and complete library construction. Because the overwhelming majority of nuclei pass through a unique combination of wells, they are “compartmentalized” by the unique barcode combination that they receive. The rate of “collisions”—i.e., nuclei coincidentally receiving the same combination of indexes—can be tuned by adjusting how many nuclei are distributed to the second set of wells (fig. S1) (19).

Fig. 1 Schematic of combinatorial cellular indexing and validation for measuring single-cell chromatin accessibility.

(A) Nuclei are isolated and molecularly tagged in bulk with barcoded Tn5 transposases in wells (different barcodes are represented by the different colors outlining the nuclei). Nuclei are then pooled and a limited number redistributed into a second set of wells. A second barcode (represented by the color filling each nucleus) is introduced during PCR. (B) Scatterplot of number of reads mapping uniquely to human or mouse genome for individual barcode combinations. (C) Fragment size distribution for single-cell ATAC-seq versus published bulk ATAC-seq (4). (D) Box plot of the fraction of reads mapping to ENCODE-defined DHSs for individual Patski and GM12878 cells.

We sought to integrate combinatorial cellular indexing and ATAC-seq to measure chromatin accessibility in large numbers of single cells. In ATAC-seq, permeabilized nuclei are exposed to transposase loaded with sequencing adapters [“tagmentation” (4, 20)]. In the context of chromatin, the transposase preferentially inserts adapters into nucleosome-free regions. These “open” regions are generally sites of regulatory activity and correlate with DNase I hypersensitive sites (DHSs).

In the integrated method, we molecularly tag nuclei in 96 wells with barcoded transposase complexes (Fig. 1A) (1719). We then pool, dilute, and redistribute 15 to 25 nuclei to each of 96 wells of a second plate, using a cell sorter. After lysing nuclei, a second barcode is introduced during polymerase chain reaction (PCR) with indexed primers complementary to the transposase-introduced adapters. Finally, all PCR products are pooled and sequenced, with the expectation that most sequence reads bearing the same combination of barcodes will be derived from a single cell (estimated collision rate of ~11% for experiments described here) (fig. S1).

As an initial test, we mixed equal numbers of nuclei from human (GM12878) and mouse [Patski (21)] cell lines, performed combinatorial cellular indexing, and sequenced the resulting library. Although at least one mappable read was observed for most of the 9216 (96 × 96) possible barcode combinations, most barcodes were associated with very few reads. We used a conservative cutoff of 500 reads per cell (19), retaining 533 barcode combinations for further analysis (fig. S2A) (range: 502 to 69,847 reads per barcode combination; median: 2503). A high PCR duplication rate (~73% of mappable, nonmitochondrial reads) confirmed that the library had been sequenced to saturation. We estimate that we recovered 13 to 55% of the molecular complexity that we could expect to recover based on complexity estimates for bulk, 500-cell ATAC-seq experiments (4, 19).

If each barcode combination represents either a mouse or human nucleus, then its corresponding reads should map overwhelmingly to either the mouse or human genome. Indeed, we observe that ~93% of 533 barcode combinations had >90% of their reads mapping to mouse (n = 290) or human (n = 207) (Fig. 1B). In addition, these data retain signals of chromatin accessibility in relation to nucleosome hindrance of insertion events (Fig. 1C). Furthermore, 52% of reads from mouse and 50% of reads from human single cells overlapped reference DHS maps [ENCODE (19, 22)] for these cell lines (20-fold and 34-fold enrichments, respectively) (Fig. 1D and table S1).

We next sought to distinguish single cells from the same species. We mixed pairs of cell lines (HEK293T or HL-60 versus GM12878), performed combinatorial cellular indexing, and sequenced the resulting libraries to saturation (65% duplicate rate). For the mixture of HEK293T and GM12878, we recovered 748 cells with ≥500 reads (fig. S2B) (range: 502 to 28,712 reads; median: 1685 reads). Focusing on reads mapping to previously defined cell-type exclusive DHS sites (fig. S3A) (19, 22), we observe a bimodal distribution, with nearly all cells assignable to one of the two cell types (~95% of 748; defined by ≥70% of reads mapping to cell type–specific DHSs corresponding to one cell type or the other) (Fig. 2A). The fraction of reads mapping to reference DHSs in single cells was again strongly enriched [41% (14-fold enrichment) for HEK293T and 52% (18-fold enrichment) for GM12878)] (Fig. 2B and table S1). About 57% of 181,379 distinct sites from the reference DHS maps were observed as accessible in at least one cell. Some fraction of these may be spurious overlaps, but this provides an upper bound on the number of DHSs for which we recovered accessibility information. Individual cells ranged in coverage of this DHS map from 29 to 5890 sites (fig. S4) (median: 429 sites).

Fig. 2 Single-cell ATAC-seq deconvolutes human cell-type mixtures.

(A to C) GM12878/HEK293T nuclei. (D to F) GM12878/HL-60 nuclei. [(A) and (D)] Histograms of proportions of reads mapping to cell type–specific DHSs that correspond to one cell type or the other. [(B) and (E)] Box plots of the overall fraction of reads mapping to ENCODE-defined DHSs for individual cells. [(C) and (F)] Multidimensional scaling of single-cell ATAC-seq data using pairwise Jaccard distances between cells based on DHS usage. Cell-type assignments based on proportions shown in (A) and (D).

For the mixture of HL-60 and GM12878, we recovered 700 cells (fig. S2C) (range: 500 to 21,887 reads; median: 1390 reads; 64% duplicate rate). Although both are representative of the hematopoietic lineage, 94% of cells were assignable based on the same criteria used for HEK293T/GM12878 (Fig. 2D and fig. S3B). The fraction of reads mapping to reference DHSs was again strongly enriched [55% (16-fold enrichment) for HL-60 and 59% (18-fold enrichment) for GM12878] (Fig. 2E and table S1). About 46% of 230,632 distinct sites from the reference DHS maps were observed as accessible in at least one cell, with individual cells ranging in coverage from 72 to 4687 sites (fig. S4) (median: 442 sites).

We next examined whether single cells within a heterogeneous mixture could be clustered in an unsupervised manner. Importantly, at the level of single cells, chromatin accessibility is a nearly binary phenomenon (~2 genome equivalents per cell), in contrast with the dynamic range of mRNA transcripts within single cells. Thus, we reasoned that we would require observations across each of many single cells to generate quantitative estimates for accessibility of a particular site in a particular cell type, within a heterogeneous population.

For each cell-type mixture, we defined the union of ENCODE DHSs [analogous to how RNA-seq transcript quantification relies on a catalog of transcript models (19)] and created a binary matrix where DHS sites were scored as “used” or “unused” in each cell. We then calculated Jaccard distances between pairs of cells on the basis of the degree of shared DHS usage. Applying multidimensional scaling to these distances, the first dimension was strongly correlated with the read depth of each cell (fig. S5) (Spearman’s rho of ~0.95), whereas the second dimension separated cells consistently with our crude cell-type assignments (Fig. 2, C and F). The extent of discrimination between cell types is proportional to read depth, but even with relatively few reads, individual cells can be clustered on the basis of shared DHS usage alone. To evaluate whether our data provided reproducible and quantitative estimates of the accessibility of DHSs, we used GM12878-assigned cells from all three experiments described above as biological replicates. For each experiment, we summed the number of cells “using” each site and compared these counts between replicates (Spearman’s rho’s of 0.64 to 0.69, or 0.54 to 0.62 when restricted to sites observed in ≥5 cells in each replicate) and also compared them with bulk ATAC-seq measurements from 500 GM12878 cells (fig. S6) [Spearman’s rho’s of 0.61 to 0.7 (4)]. This positive correlation shows that sites that are more sensitive in bulk experiments are also more commonly observed in single cells. Furthermore, these correlations are not far from the range of 0.64 to 0.72 for replicate bulk measurements from the 500-cell ATAC-seq libraries.

To identify individual DHSs with significant differences in accessibility between different cell types (based on single-cell data from the GM12878/HL-60 mixture), we performed likelihood ratio tests within the framework of a generalized linear model. We identified 1666 sites [out of 52,479 DHSs tested (19)] that were differentially accessible at a false discovery rate (FDR) of 0.05. Interestingly, only about half of these sites are cell-type exclusive in the reference DHS maps (381 GM12878-exclusive and 472 HL-60-exclusive); differentially accessible DHSs are marginally enriched for GM12878-specific sites (hypergeometric P = 0.04) and strikingly enriched for HL-60 sites (P = 2.2 × 10−15). They are also larger [1184 base pairs (bp) versus 580 bp median; Wilcoxon rank sum P = 3.4 × 10−247], observed in more cells (10 cells versus 3 cells median; Wilcoxon rank sum P ≈ 0), and enriched for “enhancer” (hypergeometric P = 4.3 × 10−12), “repressed” (P = 1.5 × 10−57), “transcribed” (P = 7.4 × 10−25), and “transcription start site” (P = 5.1 × 10−3) annotations in GM12878, relative to sites not identified as differentially accessible (Fig. 3A) (19).

Fig. 3 Single-cell ATAC-seq identifies functionally relevant differences in accessibility between cell types.

(A) Bar plot for relative fraction of DHSs overlapping each chromatin state (HL-60 versus GM12878). Gray bars show frequencies for all sites tested. Blue bars show frequencies for differentially accessible sites. CTCF, CTCF-enriched element; E, predicted enhancer; PF, predicted promoter flanking region; R, predicted repressed; T, predicted transcribed; TSS, predicted promoter region; WE, predicted weak enhancer. *, significant difference in proportions. Values do not add to 1 because sites can overlap multiple chromatin states. (B) Multidimensional scaling of chromatin accessibility data for 14,533 cells (GM12878/HL-60 mixtures from 13 experiments on four dates). (C) Heat map of hypersensitive site usage for 10,241 cells (columns) at 21,378 DHSs (rows) (GM12878/HL-60 mixtures). Colors indicate accessibility of sites after latent semantic indexing. Top color bar is coded by cell-type assignments (green, HL-60; blue, GM12878; black, unassigned). Left color bar indicates modules formed by clustering DHSs.

We next linked differentially accessible sites defined from single cells to the genes they potentially regulate (2) and compared these to genes differentially expressed between GM12878 and HL-60 (19). Of 8268 genes linked to ≥1 DHS and expressed in both cell types, 4095 were differentially expressed and 2211 were linked to ≥1 differentially accessible DHS (FDR 0.05). Although the DHS-gene linkages are imperfect, we observe a significant overlap of differentially expressed and differentially accessible genes (1162 genes overlap; hypergeometric P = 4.8 × 10−4). The genes linked to DHSs identified as differentially accessible are enriched for lymphoid and myeloid lineage annotations—e.g., “cytokine signaling” and “antigen processing” (figs. S7 and S8).

To optimize combinatorial cellular indexing, we tested 12 conditions on 3 days, always with GM12878/HL-60 mixtures. We collected as many as nearly 1500 cells in a single experiment, and we improved the median read depth to >3000 per cell in some experiments (figs. S9 to S11). We merged chromatin accessibility maps for 14,533 single cells (all GM12878/HL-60) and conducted multidimensional scaling. Although the actual mixture proportion varied between experiments, the clustering of the two cell types was highly robust to experimental condition (Fig. 3B). With this full complement of cells, ~96% of 230,632 potential sites in our DHS reference map are observed as accessible in at least one cell (individual cells covering between 4 and 12,333 sites (median: 664 sites) (fig. S4).

We used latent semantic indexing to reduce the dimensionality of this matrix [after filtering out low coverage cells and rarely used sites (19)], yielding a heat map of chromatin accessibility for 10,241 cells at 21,378 DHSs (Fig. 3C and fig. S12). This resulted in two large clades corresponding to the two cell types, while also identifying the subset of sites underlying that separation. Additionally, we observe a number of smaller modules of DHSs that exhibit coordinately regulated chromatin accessibility. Linking these sites again to the genes they potentially regulate (2), the major modules are enriched for gene ontology terms consistent with the two cell types (e.g., “osteoclast differentiation” for a module more open in HL-60) (Fig. 3C and figs. S13 and S14).

To evaluate cell-to-cell variation within a cell type, we took the subset of cells classified as GM12878 and repeated latent semantic indexing (19), yielding a heat map of chromatin accessibility for 4118 cells at 22,755 DHSs. Hierarchical clustering identified four major subgroups of single cells and seven modules of coordinately regulated chromatin accessibility (Fig. 4A). These modules of DHSs are enriched for binding by particular transcription factors (hypergeometric FDR 0.10) (fig. S15), in some cases quite strongly, and are linked to genes associated with immune response, cell cycle regulation, and other processes (figs. S16 and S17). Importantly, although we included samples from experiments conducted on different days, the cell subtypes do not cluster by experiment (figs. S18 and S19), and the enrichments for transcription factor binding within subtype-defining modules are apparent even with subsets of the data (figs. S20 and S21). Sites in modules 1 and 2 are highly enriched for binding by transcription factors such as nuclear factor κB (NF-κB) and other factors downstream of the B cell receptor (19). The four GM12878 subtypes appear principally defined by the activation status of these two modules, suggesting that variability across the cells is driven by NF-κB activity. These results indicate that even within an apparently homogeneous cell type, we are able to identify subsets of cells with differences in their regulatory landscape related to cell cycle and possibly environmental signals. Focusing on individual loci within GM12878, we observe sets of regulatory sites that exhibit patterns of coordinated regulation (e.g., LYN, encoding a tyrosine kinase involved in B cell signaling) (Fig. 4B), although reproducibility of these patterns across biological replicates was modest (fig. S22). Given the sparsity of the data, identifying pairs of coaccessible DNA elements within individual loci is statistically challenging and merits further development.

Fig. 4 Single-cell ATAC-seq identifies GM12878 subtypes.

(A) Heat map of chromatin accessibility measures after latent semantic indexing of DHS usage shows that GM12878 cells cluster into subpopulations. Modules of coordinately accessible chromatin accessibility are significantly enriched for binding of selected transcription factors (TFs) (examples on right). (B) Detailed depiction of LYN locus. The top shows coaccessibility scores between the transcription start sites and four putative enhancers in the region, which are Pearson correlation values of latent semantic indexing–based accessibility scores between cells, for six DHSs present in this region. Height and thickness of each loop indicates the strength of correlation (red, positive; blue, negative). Middle shows in which subtypes [defined in top bar of (A)] these elements are most often accessible. Bottom shows ENCODE data for this region from the University of California–Santa Cruz browser, including transcript model, DHS peaks, chromatin immunoprecipitation sequencing (ChIP-seq) binding profiles for several TFs, and predicted chromatin state.

We report chromatin accessibility maps for >15,000 single cells. Our combinatorial cellular indexing scheme could feasibly be scaled to collect data from ~17,280 cells per experiment by using 384-by-384 barcoding and sorting 100 nuclei per well (assuming similar cell recovery and collision rates) (fig. S1) (19). Particularly as large-scale efforts to build a human cell atlas are contemplated (23), it is worth noting that because DNA is at uniform copy number, single-cell chromatin accessibility mapping may require far fewer reads per single cell to define cell types, relative to single-cell RNA-seq. As such, this method’s simplicity and scalability may accelerate the characterization of complex tissues containing myriad cell types, as well as dynamic processes such as differentiation.

Supplementary Materials

Materials and Methods

Figs. S1 to S22

Tables S1 and S2

References (2439)

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: We thank the Trapnell and Shendure laboratories, particularly R. Hause, C. Lee, V. Ramani, R. Qiu, Z. Duan, and J. Kitzman, for helpful discussions; D. Prunkard, J. Fredrickson, and L. Gitari in the Rabinovitch laboratory for their exceptional assistance in flow sorting; and C. Disteche for the Patski cell line. This work was funded by an NIH Director’s Pioneer Award (1DP1HG007811 to J.S.) and a grant from the Paul G. Allen Family Foundation (J.S.). C.T. is supported in part by the Damon Runyon Cancer Research Foundation (DFS-#10-14). All sequencing data are available from the NIH National Center for Biotechnology Information Gene Expression Omnibus (accession number GSE68103). L.C., K.L.G., and F.J.S. declare competing financial interests in the form of stock ownership and paid employment by Illumina, Inc. One or more embodiments of one or more patents and patent applications filed by Illumina may encompass the methods, reagents, and data disclosed in this manuscript. All methods for making the transposase complexes are described in (18); however, Illumina will provide transposase complexes in response to reasonable requests from the scientific community subject to a material transfer agreement. Some work in this study is related to technology described in patent applications WO2014142850, 2014/0194324, 2010/0120098, 2011/0287435, 2013/0196860, and 2012/0208705.
View Abstract

Navigate This Article