Research Article

# Combinatorial labeling of single cells for gene expression cytometry

See allHide authors and affiliations

Science  06 Feb 2015:
Vol. 347, Issue 6222, 1258367
DOI: 10.1126/science.1258367

## Single-cell expression analysis on a large scale

To understand why cells differ from each other, we need to understand which genes are transcribed at a single-cell level. Several methods measure messenger RNA (mRNA) expression in single cells, but most are limited to relatively low numbers of cells or genes. Fan et al. labeled each mRNA molecule in a cell with both a cellular barcode and a molecular barcode. Further analysis did not then require single-cell technologies. Instead, the labeled mRNA from all cells was pooled, amplified, and sequenced, and the gene expression profile of individual cells was reconstructed based on the barcodes. The technique successfully revealed heterogeneity across several thousand blood cells.

Science, this issue 10.1126/science.1258367

## Structured Abstract

### INTRODUCTION

The measurement of specific proteins and transcripts in individual cells is critical for understanding the role of cellular diversity in development, health, and disease. Flow cytometry has become a standard technology for high-throughput detection of protein markers on single cells and has been widely adopted in basic research and clinical diagnostics. In contrast, nucleic acid measurements such as mRNA expression are typically conducted on bulk samples, obscuring the contributions from individual cells. Ideally, in order to characterize the complexity of cellular systems, it is desirable to have an affordable approach to examine the expression of a large number of genes across many thousands of cells.

### RATIONALE

Here, we have developed a scalable approach that enables routine, digital gene expression profiling of thousands of single cells across an arbitrary number of genes, without using robotics or automation. The approach, termed “CytoSeq,#x201D; employs a recursive Poisson strategy. First, single cells are randomly deposited into an array of picoliter wells. A combinatorial library of beads bearing cell- and transcript-barcoding capture probes is then added so that each cell is partitioned alongside a bead. The bead library has a diversity of ~106 so that each cell is paired with a unique cell barcode, whereas the transcript barcode diversity is ~105 so that each mRNA molecule within a cell becomes specifically labeled. After cell lysis, mRNAs hybridize to beads, which are pooled for reverse transcription, amplification, and sequencing. Because cDNAs from all polyadenylated transcripts of each cell are covalently archived on the bead surface, any selection of genes can be analyzed. The digital gene expression profile for each cell is reconstructed when barcoded transcripts are assigned to the cell of origin and counted.

### RESULTS

We applied CytoSeq to characterize complex heterogeneous samples in the human hematopoietic system by examining thousands of cells per experiment. In addition to surface proteins that are traditional cell type markers, we examined genes coding for cytokines, transcription factors, and intracellular proteins of various cellular functions that may not be readily accessible by flow cytometry. We demonstrated the ability to identify major subsets within human peripheral blood mononuclear cells (PBMCs). We compared cellular heterogeneity in resting CD3+ T cells versus those stimulated with antibodies to CD3 and CD28, as well as resting CD8+ T cells versus those stimulated with CMV peptides, and identified the rare cells that were specific to the antigen. Highlighting the specificity of large-scale single-cell analysis compared with bulk sample measurements, we found that the up-regulation of a number of genes in the stimulated samples originated from only a few cells (<0.1% of the population).

### CONCLUSION

The routine availability of gene expression cytometry will help transform our understanding of cellular diversity in complex biological systems and drive novel research and clinical applications. The massively parallel single-cell barcoding strategy described here may be applied to assay other biological molecules, including other RNAs, genomic DNA, and the genome and the transcriptome together.

## Abstract

We present a technically simple approach for gene expression cytometry combining next-generation sequencing with stochastic barcoding of single cells. A combinatorial library of beads bearing cell- and molecular-barcoding capture probes is used to uniquely label transcripts and reconstruct the digital gene expression profile of thousands of individual cells in a single experiment without the need for robotics or automation. We applied the technology to dissect the human hematopoietic system and to characterize heterogeneous response to in vitro stimulation. High sensitivity is demonstrated by detection of low-abundance transcripts and rare cells. Under current implementation, the technique can analyze a few thousand cells simultaneously and can readily scale to 10,000s or 100,000s of cells.

Understanding cellular diversity in large collections of cells requires the measurement of specific proteins or transcripts expressed by individual cells. Flow cytometry is well established as an important tool for detecting proteins in single cells. In contrast, RNA expression measurements are typically conducted on bulk samples, obscuring individual cell information. Although single-cell RNA expression measurements using microtiter plates or commercial microfluidic chips have recently been reported (19), these approaches require elaborate instrumentation and are expensive to implement for routine transcriptome analysis of large numbers of cells. Other methods based on nanoliter or picoliter partitioning of single cells followed by reverse transcription polymerase chain reaction (RT-PCR) (1012) allow for the analysis of large numbers of cells but are practical for only a few genes at a time. To characterize the contributions of individual cells to a complex biological system, it is desirable to have an affordable and accessible method to couple large numbers of cells with large numbers of genes.

Here, we have developed a scalable approach that enables routine, digital gene expression profiling of thousands of single cells across an arbitrary number of genes. Microscale engineering and combinatorial techniques were used to label all mRNA molecules in a cell with a unique cellular barcode in a single massively parallel step. In addition, each transcript copy within a cell was indexed with a molecular barcode (7, 1317), allowing PCR bias correction and absolute digital gene expression measurements. Labeled mRNA molecules from all cells were pooled, amplified, and sequenced. The digital gene expression profile of each cell was reconstructed using the cell and molecular barcodes on each sequence. This technology, which we term CytoSeq, enables the equivalent of protein flow cytometry for gene expression. We have applied the technique to multiparameter genetic classification of the hematopoietic system and demonstrated its use for characterizing cellular heterogeneity in immune response and for identifying rare cells in a population.

## Results

### Cell capture, labeling, and sequencing

The CytoSeq procedure is outlined in Fig. 1A. First, a cell suspension is loaded onto a microfabricated surface with up to 100,000 microwells. Each 30-μm-diameter microwell contains a volume of ~20 picoliters. The number of cells in the suspension is adjusted so that only about 1 out of 10 wells receives a cell. Cells simply settle into wells by gravity.

Next, the bead library is loaded onto the microwell array to saturation, so that most wells become filled. The dimensions of the beads and wells have been optimized to prevent double occupancy of beads. Each 20-μm bead has been functionalized with tens to hundreds of millions of oligonucleotide primers of the structure outlined in Fig. 1B. Oligonucleotides consist of a universal PCR priming site, followed by a combinatorial cell label, a molecular index, and an mRNA capture sequence of oligo(deoxythymidine) [oligo(dT)]. All primers on each bead share the same cell label but incorporate a diversity of molecular indices. A combinatorial split-pool synthesis method was devised to generate the bead library attaining a cell labeling diversity of close to 1 million. Because only ~1% of the total available cell label diversity is used, the probability of having two single cells tagged with the same label is low (on the order of 10−4). Similarly, the diversity of the molecular labels on a single bead is on the order of 105, so that the likelihood of two transcript molecules of the same gene from the same cell tagged with the same molecular index is low. Once single cells and beads are colocalized in the microwells, a lysis buffer is applied onto the surface and allowed to diffuse inside. The high local concentration of released mRNAs (10s of nanomolar) effectively drives their hybridization onto the beads.

After mRNA capture, all beads are magnetically retrieved from the microwell array. From this point on, all reactions are carried out in a single tube. After reverse transcription, cDNA molecules synthesized on each bead become encoded with cell and molecular barcodes and serve as amplification templates (fig. S1).

Sequencing of amplification products reveals the cell label, the molecular index, and the gene identity (fig. S1). Computational analysis groups the reads based on the cell label and collapses the reads with the same molecular index and gene sequence into a single entry to correct for amplification bias, allowing the determination of absolute transcript numbers for each gene in each cell.

### Identifying cell types in cell mixtures

To demonstrate the ability of CytoSeq to identify individual cells among a population of two cell types, a ~1:1 mixture of K562 (myelogenous leukemia) and Ramos (Burkitt’s lymphoma) cells was loaded onto a partial array of ~25,000 microwells. A panel of 12 genes was selected and amplified from the cDNA beads and sequenced. The panel consisted of five genes specific for K562 cells, six genes specific for Ramos cells, and the common housekeeping gene GAPDH (table S1A). The majority of the sequencing reads (~78%) were associated with 765 unique cell labels (fig. S2).

The gene expression profile of each cell was clustered using principal component analysis (PCA) (Fig. 2A). The first principal component (PC) separated the cells into two major clusters based on cell type. The genes that contribute to the positive side of the first PC were specific to Ramos, whereas the genes that contributed to the negative side of the same PC were specific to K562. The second PC highlighted the high degree of variability in fetal hemoglobin (HBG1) expression within the K562 cells, which has been observed previously (15, 18).

Next, we spiked in a small number of Ramos cells into primary B cells from a healthy individual. A panel of 111 genes (table S1B) known to be involved in B cell function (1921) was analyzed across 1198 cells. Eighteen cells (~1.5% of the population) were found to have a distinct gene expression pattern (Fig. 2B). Genes preferentially expressed by this group are known to be associated with Burkitt’s lymphoma and include MYC and immunoglobulin M (IgM) and markers (CD10, CD20, CD22, and BCL6) associated with follicular B cells from which Burkitt’s lymphoma originates (22) (Fig. 2C and fig. S3). In addition, this group of cells contained higher levels of CCND3 and GAPDH, as well as an overall higher mRNA content (Fig. 2B). This finding is consistent with the fact that lymphoma cells are physically larger than primary B cells in normal individuals and that they are rapidly proliferating and transcriptionally more active.

Several methods were used to estimate the RNA capture efficiency of CytoSeq. First, GAPDH in 10 pg of Ramos total RNA was measured to be ~343 copies using digital RT-PCR. Additionally, we tested individual Ramos cells using a sensitive molecular indexing technique (15) and determined an average of 214 ± 36 (SE) GAPDH transcripts per cell. These results were comparable to the 152 ± 10 (SE) copies yielded by CytoSeq. Further, we compared CytoSeq measurements with bulk RNA-seq data of ~20 million Ramos cells obtained from Sultan et al. (23). We observed a high correlation of gene expression levels (R2 = 0.946) across three orders of magnitude (fig. S4).

### Dissecting human PBMCs

Most biological samples, such as blood, contain diverse populations of numerous cell types and states with subtle differences in gene expression profiles. We applied CytoSeq to identify all the major cell types in human peripheral blood mononuclear cells (PBMCs), including monocytes, natural killer (NK) cells, and the different T and B cell subsets. Unlike traditional immunophenotyping that is limited mostly to surface protein markers, we included 98 genes for cytokines, transcription factors, and intracellular proteins of various cellular functions in addition to surface proteins (table S1C). We analyzed the digital gene expression profile of 632 PBMCs (Fig. 3 and fig. S5). The first PC separated monocytes and lymphocytes into two orthogonal clusters by the expression of CD14, CD16a, S100A12, and CCR2 in one cluster and lymphocyte-associated genes in the other. Subtypes of lymphocytes were located in a continuum along the second PC, with B cells [expressing immunoglobulins M and D (IgM, IgD), TCL1A, CD20, CD24, and PAX5) at one end, naïve T cells (expressing CD4, CCR7, and CD62L]) in the middle, and cytotoxic T cells (expressing CD8A, CD8B, EOMES, and PRF1) at the other end. Natural killer cells expressing killer-like immunoglobulin receptor (KIR), CD16a, and perforin (PRF1) were localized in the space between monocytes and cytotoxic T cells. We also observed that GAPDH, an enzyme involved in cellular metabolism, was expressed at highest levels in monocytes and lowest in B cells. Correlation analysis of gene expression profiles reiterated observations with PCA and revealed additional smaller subsets of cells within each major cell type (fig. S6). A replicate experiment with PBMCs from the same individual using 731 cells yielded similar segregation and cell type frequencies (fig. S7).

### Measuring heterogeneity in activated T cells

When measurements are performed on a bulk sample, the observed gene expression levels cannot be assigned to individual cells, and large expression changes from a small number of cells cannot be distinguished from small changes in a large number of cells. The overall gene expression pattern is therefore dependent on the cell composition and the expression level of each gene in each cell type or subtype. To illustrate, we studied the variability of response of human T cells to an in vitro stimulus.

CD3+ T cells, isolated by negative selection from a blood donor, were stimulated with antibodies to CD28 and CD3 for 6 hours and analyzed by CytoSeq. A separate sample of unstimulated cells was also tested. Expression of a panel of 93 genes (table S1D) that included surface proteins, cytokines, and effector molecules expressed by different T cell subsets was measured (2426). A total of 3517 and 1478 cells were analyzed for the stimulated and unstimulated samples, respectively.

In the unstimulated sample, PCA analysis revealed two major subsets of cells. Examination of the genes enriched showed that one subset represented cytotoxic T cells with expression of CD8A, CD8B, NKG2D, GZMA, GZMH, GZMK, and EOMES, and the other subset represented helper T cells with expression of CD4, CCR7, and SELL (Fig. 4A and fig. S8).

In the stimulated sample, two branches of cells were observed on the PCA plot (Fig. 4B and fig. S9). The first principal component represented cellular response in terms of IRF4, gamma interferon (IFNG), CD69, tumor necrosis factor (TNF), and GAPDH expression. CCL3, CCL4, and GZMB, cytokines and effector molecules expressed in cytotoxic T cells, and LAG3, a marker associated with exhausted cells, were localized to cells in the upper branch. Expression of interleukin 2 (IL2), LTA, CCL20, and CD40LG, cytokines and surface protein associated with T helper cells (TH1), were localized to cells in the lower branch. Additional genes, including IL4R, PRDM1, TBX21, ZBED2, MYC, FOSL1, CSF2, TNFRSF9, and BCL2, were expressed to various degrees in a smaller number of cells (fig. S9). Most of these cytokines, effector molecules, and transcription factors were either not expressed or were expressed at very low levels by cells in the unstimulated sample. Although most of the cells that responded within this short period of stimulation were presumably memory cells, we observed a small population of cells that produced lower levels of IL2 and no other cytokines or effector molecules and might represent naïve cells (Fig. 4B, arrow).

To fully appreciate the heterogeneity in response, cells were clustered based on pairwise correlation coefficients. Although the two main groups of helper and cytotoxic cells were observed, there were additional smaller subsets of cells as well as considerable diversity within each subset in terms of the combination and level of activated genes expressed (fig. S10, A and B).

A few cytokines, namely IL4, IL5, IL13, IL17F, IL22, LIF, IL3, and IL21, were highly up-regulated in the stimulated sample as a whole as compared with the unstimulated sample, but were contributed only by a few cells in the sample (Fig. 4C). Subsets of these cytokines were expressed by the same cells (fig. S11). For instance, just one cell coexpressed IL17F and IL22, a signature for TH17 cells. Another seven cells expressed various combinations of IL4, IL5, and IL13, signatures of TH2 cells. The observed frequencies of TH2 and TH17 cells, which primarily reside in secondary lymphoid tissues and are rare in peripheral blood, agreed with those previously measured by flow cytometry (27, 28). The low frequency of specialized cells contributing to large changes in overall gene expression highlights the importance of sampling large number of single cells.

The large up-regulation of a set of genes in very few cells served as an important control to measure system cross-talk. Although any uncaptured RNA should be diluted into the vast volume of lysis buffer above the microwells, it is possible that a highly expressed mRNA could contaminate beads in neighboring microwells. However, high transcript counts for IL17F and IL22 (~230 copies) were restricted to just one cell, with no indication of significant contamination across other beads, suggesting a low level of any cross-talk that might have occurred.

Analysis of T cells from a second blood donor (669 and 595 stimulated and unstimulated cells, respectively) also demonstrated similar partitioning of cells into the two main T cell branches (figs. S8, S9, and S10, C and D).

### Identifying rare antigen-specific T cells

Fresh blood from the previous two individuals, who were seropositive for cytomegalovirus (CMV), was exposed to a CMV pp65 peptide pool. An untreated blood sample from each donor served as a control. Using negatively selected CD8+ T cells, we analyzed 581 CMV-exposed cells and 253 unexposed cells for donor 1, and 2274 exposed and 2337 unexposed cells for donor 2.

Donor 1’s negative control did not yield a sufficient number of cells to form well-separated clusters. The rest of the samples showed two main groups of cells (Fig. 5 and figs. S12 and 13). Cells in one group expressed naïve and central memory-associated genes (SELL, CCR7, and CD27), whereas cells in the other group expressed effector memory (CCL4, CX3CR1, and CXCR3) and effector-associated genes (EOMES, GZMA, GZMB, GZMH, TBX21, and ZNF683) (25, 26, 29). There was a small but distinct cluster of cells expressing granzyme K (GZMK) and another cluster expressing HLA class II histocompatibility antigen DR alpha chain (HLA-DRA). The differential expression of the different types of granzymes has been reported previously (30). Our results here recapitulated those observed in CyTOF experiments with CD8+ T cells (31).

Although a considerable proportion of cells responded to antigen exposure by expression of CD69 and MYC (fig. S13A), only a few cells expressed IFNG, a signature cytokine for activated antigen-specific cells. The IFNG-expressing cells were among the most transcriptionally active and belong to the effector memory and effector cell cluster (Fig. 6 and fig. S14). We identified 5 out of 581 (0.86%) and 2 out of 2274 (0.09%) cells in donors 1 and 2, respectively, that were likely CMV-specific based on IFNG expression and overall transcription levels. This frequency of occurrence agrees with that observed by cytokine flow cytometry, enzyme-linked immunospot assay (ELISPOT), and tetramer assay (32). Among those cells, there was a substantial amount of heterogeneity in terms of combinations and levels of effector molecules (e.g., granzymes) and cytokines (e.g., IFNG, IL2, CCL3, CCL4, TNF, CSF2, and IL4) expressed (fig. S15). Interestingly, the most transcriptionally active cell in donor 2 did not express IFNG but expressed IL6 and IL1B, inflammatory cytokines that are usually produced by monocytes and macrophages and might represent a unique rare subtype of CD8+ T cell.

## Discussion

We present here highly scalable mRNA cytometry that uses a recursive Poisson strategy to isolate single cells, uniquely barcode cellular content, and index individual molecules for quantitative analysis. We have shown the simultaneous identification and counting of transcript molecules belonging to each cell in a sample of thousands of cells. Further, this technique can be used to characterize individual cells based on their expression profiles in naturally occurring heterogeneous systems and to detect rare cells in a large background population.

CytoSeq offers advantages over existing single-cell approaches using microtiter plates or commercial microfluidic chips. Because the experimental procedure does not require expensive instrumentation, and reagent consumption per cell is low (in the nanoliter range), one can readily carry out single-cell analysis on large numbers of cells across multiple conditions. In this study, we performed single-cell gene expression measurements of ~15,000 cells across 12 experiments, which would be costly and time-consuming if attempted by existing methods. The number of cells measured by CytoSeq can be readily scaled to 10,000s or 100,000s by increasing the size of the microwell array and the library size of the barcoded beads. We calculate that the consumable cost (i.e., barcoded beads, microwells, enzymes, and primers) required for a 10,000-cell experiment is at least two to three orders of magnitude lower (less than $1 per cell) than current commercial microfluidics approaches. In addition, the method used for isolating single cells in CytoSeq does not impose a restriction on the uniformity of cell sizes, thus allowing direct analysis of complex samples of heterogeneous cell size and shape, such as PBMCs. Although our experiments focused on the hematopoietic system, solid samples can also be analyzed using mechanical or enzymatic tissue-dissociation methods well established for flow cytometry. Single-cell transcriptome analysis is a powerful approach for characterizing and understanding cellular diversity. In addition, CytoSeq complements and expands the capabilities of fluorescence or mass spectrometry–based cytometry (33, 34). It increases versatility in terms of the numbers and types of gene products studied. Unlike flow cytometry, which is largely restricted to surface proteins with high-affinity antibodies, CytoSeq detects any transcribed mRNA without the limitations of antibody availability. In addition, the high sensitivity and multiparametric features of CytoSeq enable rare cell characterization on small samples with insufficient cells for traditional flow cytometry. Finally, sequencing in CytoSeq provides nucleotide precision along with quantitative gene expression information, permitting qualitative examinations of genetic structure across individual cells, such as mutations or variants, T cell receptors (35), or immunoglobulins. Because all polyadenylated RNAs are captured and archived as covalently attached cDNAs on the beads, one can elect to study any arbitrary set of genes or repeat the analysis with other sets of genes. Whole transcriptome sampling is also possible by employing universal cDNA amplification techniques (3639) or amplification-free single molecule DNA sequencing (4042). However, it is important to consider the sequencing depth required for routine characterization of large number of cells across the whole transcriptome. Assuming ~200,000 transcript molecules per cell (considering only the polyadenylated portion of the transcriptome), each cell would require ~2 million reads for a 10-fold coverage. Thus, to examine 1000 cells per sample, upwards of ~2 billion reads are required—equivalent to the current output of the highest-capacity next-generation sequencer (HiSeq. 2500), of which a single run can cost more than$15,000 and requires 6 days to complete. In contrast, the same number of cells assayed across ~100 genes requires only a small sequencing run (1 to 10 million reads, less than \$1000, and a 1-day runtime). Shallow whole transcriptome sequencing (20,000 to 500,000 reads per cell), surveying several hundred to slightly over a thousand of the highly abundant genes is an alternative, as it has been shown to be sufficient to differentiate cell types in some samples (5, 43). The trade-offs of between depth of sequencing and differential gene expression of single cells have recently been discussed (44). As sequencing costs continue to fall, whole-transcriptome implementation of CytoSeq will become more affordable to a larger number of researchers to explore larger numbers of cells and conditions.

The routine availability of gene expression cytometry will help transform our understanding of cellular diversity in complex biological systems and drive novel clinical applications, such as circulating tumor cell analysis, diagnostics of immune disorders and infectious diseases, monitoring of immunotherapy and vaccinations, and therapeutic development. The massively parallel single-cell barcoding strategy described here may be applied to assay other biological molecules, including other RNAs, genomic DNA, and the genome and the transcriptome together.

## Materials and methods

### Fabrication of microwell array

Microwell arrays were fabricated using standard photolithography. An array (~35 by 15 mm) containing ~150,000 micropillars were patterned with SU-8 on a silicon wafer. Polydimethylsiloxane (PDMS) was poured onto the wafer to create arrays of microwells. Replicas of the wafer were made with NOA63 optical adhesive using the PDMS microwell array as template. Agarose (5%, type IX-A, Sigma) microwell arrays were cast from the NOA63 replica before each experiment. For experiments described here, a subsection of the full array was cut and used, ranging from ~25,000 to 100,000 wells (table S3). The size of the microwell array can be increased simply by increasing the total area of the microwell array pattern on the lithography mask and fabricated with the same steps above. For instance, a 3′′ by 2′′ microscope slide can hold ~1.4 million wells, for capturing ~140,000 single cells.

Beads were manufactured by Cellular Research, Inc., using a split-pool combinatorial approach. Briefly, 20-μm-diameter magnetic beads functionalized with carboxyl groups (Spherotech) were distributed into 96 tubes. Using carbodiimide chemistry, a 5′ amine modified oligonucleotide bearing a universal PCR priming sequence, a unique eight-nucleotide cell label, and a common linker were coupled to the beads in each tube. Conjugated beads from all tubes were then pooled and split into a second set of 96 tubes for annealing to template oligonucleotides bearing the complement to the common linker, another eight-nucleotide cell label, and a new common linker. After enzymatic polymerization, the beads were again pooled and split into a third set of 96 tubes for annealing to oligonucleotides bearing oligo(dA)17 on the 5′ end, followed by a randomly synthesized eight-nucleotide sequence that serves as the molecular index, a third eight-nucleotide cell label, and a complementary sequence to the second linker. After enzymatic polymerization, the beads were pooled to derive the final library. Each resulting bead is coated with tens to hundreds of millions of oligo-(dT) oligonucleotides of the same clonally represented cell label (884,736 or 96 × 96 × 96 possible barcodes) and a molecular indexing diversity of 65,536 (48). The library size is increased exponentially by linearly increasing the diversity at each step of synthesis.

### Sample preparation

K562 and Ramos cells were cultured in RPMI-1640 with 10% fetal bovine serum (FBS) and 1x antibiotic-antimycotic. Primary B cells from a healthy donor were purchased from Sanguine Biosciences. PBMCs from a healthy donor were isolated from fresh whole blood in sodium heparin tubes acquired from the Stanford Blood Center using Lymphoprep solution (StemCell).

### T cell stimulation

Heparinized whole blood of two CMV seropositive blood donors was obtained from the Stanford Blood Center. For CMV stimulation, 1 ml of whole blood was stimulated with CMV pp65 peptide pool diluted in phosphate-buffered saline (PBS) (Miltenyi Biotec) at a final concentration of 1.81 μg/ml for 6 hours at 37°C. A separate sample of whole blood from each donor was incubated with PBS as negative controls. CD8+ T cells were isolated using RosetteSep cocktail (StemCell) and subsequently deposited onto microwell arrays. For stimulation with antibodies to CD3 and CD28, T cells from the same two donors were isolated from whole blood using RosetteSep T cell enrichment cocktail and resuspended in RPMI-1640 with 10% FBS and 1x antibiotic-antimycotic. One sample of cells from each donor was incubated with Dynabeads Human T-Activator CD3/CD28 (Life Technologies) at ~1:1 bead-to-cell ratio at 37°C for 6 hours. A separate sample of cells from each donor was placed in the incubator without stimulation to serve as negative controls.

### Single-cell capture

Cell density was measured by hemocytometer counting (table S3) and adjusted to achieve ~1 captured cell per 10 or more microwells. The cell suspension was pipetted onto the microwell array and allowed to settle by gravity. Cell filling of microwells was confirmed by microscopy. Cells that settled on the surface in-between wells (~77% of the total surface area under current design) were removed and could be saved for future use. The bead library was then loaded at a density of ~5 beads per well to saturate all wells. Excess beads were washed away and cold lysis buffer (0.1 M Tris-HCl pH 7.5, 0.5 M LiCl, 1% LiSDS, 10 mM EDTA, and 5 mM dithiothreitol) was pipetted over the surface of the microwell array. After 10 min of incubation on a slide magnet, the lysis buffer covering the array was removed and replaced by fresh lysis buffer. Beads with captured mRNAs were retrieved by placing the magnet on top of the microwell array. Beads in solution were collected into a microcentrifuge tube by pipetting and washed twice in the tube with wash A buffer (0.1 M Tris-HCl, 0.5 M LiCl, and 1 mM EDTA) and once with wash B buffer (20 mM Tris-HCl pH 7.5, 50 mM KCl, and 3 mM MgCl2). Under current implementation without the use of automation, this process takes up to 1.5 hours with two samples processed in parallel.

### cDNA synthesis

Washed beads were resuspended in 40 μl RT mix (Life Technologies, 1x First Strand buffer, 20 units SuperaseIN RNase Inhibitor, 200 units SuperScript II or SuperScript III, 3 mM additional MgCl2, 1 mM dNTP, and 0.2 μg/μl BSA) in a microcentrifuge tube rotated at 16 revolutions per minute in an oven at 50°C for 50 min (when using SuperScript III for the early experiment with K562 and Ramos cells) or 42°C for 90 min (when using Superscript II for all other experiments). After cDNA synthesis, excess oligonucleotides on the beads were removed by treatment with 20 units of ExoI (NEB) in 40 μl of 1x ExoI buffer at 37°C for 30 min and then inactivated at 80°C for 15 min.

### Multiplex PCR and sequencing

Gene sequences were retrieved from RefSeq. Each marker panel consists of two sets of gene-specific primers designed using Primer3. MATLAB was used to select PCR primers with minimal 3′ end complementarity within each set (table S1). The amplification scheme is shown in fig. S1. PCR was performed on the beads with the KAPA Fast Multiplex Kit, using 50 nM of each gene-specific primer in the first primer set and 400 nM universal primer in 50 μl (for K562 and Ramos cell mixture), 100 μl (for PBMC and B cell experiments) or 200 μl (for T cell experiments), with the following cycling protocol: 3 min at 95°C; 15 cycles of 15 s at 95°C, 60 s at 60°C, 90 s at 72°C; and 5 min at 72°C. The increase in PCR volume was to mitigate the inhibitory effect of the iron on the magnetic beads. The beads were recovered using a magnet, and PCR products were purified with 0.7x Ampure XP (Beckman Coulter). Half of the purified products were used for the next round of nested PCR, with the second primer set using the same KAPA kit and cycling protocol. After clean-up with 0.7x Ampure XP, one-tenth of the product was input into a final PCR reaction whereby the full-length Illumina adaptors were appended (1x KAPA HiFi Ready Mix, 200 nM of primer P5, 200 nM of primer P7; 95°C 5 min; eight cycles of 98°C 15 s, 60°C 30 s, 72°C 30 s; 72°C 5 min). Sequencing was performed on the Illumina MiSeq instrument with 150x2 base pair chemistry at a median depth of 1.6 million reads per sample (table S2).

### Data analysis

The cell label, the molecular index, and the gene identity were detected on each sequenced read (fig. S1). Gene assignment for the second paired read (read 2) was performed using the alignment software “bowtie2” (45) with default settings. Cell labels and molecular indices on the first paired read (read 1) were analyzed using custom MATLAB scripts. Only reads perfectly matching the combinatorial cell barcodes were retained, but this requirement may be further relaxed since the cell barcodes were designed to enable error correction (46). Reads were grouped first by cell label, then by gene identity and molecular index. To calculate the number of unique molecules per gene per cell, the molecular indices of reads of the same gene transcript from the same cell were clustered. Edit distance greater than 1 nucleotide was considered as a unique cluster, and thus a unique transcript molecule. A table containing the digital gene expression profile of each cell was constructed for each sample; each row in the table represents a unique cell, each column represents a gene, and each entry in the table represents the count of unique transcript molecules for that gene in any given cell. The table was filtered to remove unique molecules that were sequenced only once (i.e., redundancy = 1). For the experiment with mixture of K562 and Ramos cells, cells with 30 or more total unique molecules were retained for clustering. For the rest of the experiments, cells with a sum of 10 or more unique molecules or with coexpression of four or more genes in the panel were retained for clustering. The filtered table was then used for clustering analysis. PCA and hierarchical clustering were performed on natural log-transformed transcript count (with pseudo-count of 1 added) with built-in functions in MATLAB.

### Measurement of GAPDH copy number in single Ramos cell using alternative methods

In the first method, total RNA from Ramos cells was extracted by RNeasy Mini Kit (Qiagen) and quantified by Nanodrop. Serial dilutions down to 7 pg were prepared and loaded onto the 12.765 Digital Array (Fluidigm) with 1x EXPRESS SuperScript quantitative PCR mix (Life Technologies), 1x EXPRESS SuperScript enzyme mix (Life Technologies), 1x GAPDH FAM assay (ABI), 1x Loading Reagent (Fluidigm), and 1x ROX dye. The array was analyzed on the BioMark (Fluidigm) with the following protocol: 50°C for 15 min, 95°C for 2 min, and 35 cycles of 95°C for 15 s and 60°C for 60 min. GAPDH was measured to be 34 copies per pg of total RNA. In the second method, a Ramos cell suspension was diluted in PBS to about 1 cell per 10 microliters. A microliter of suspension was pipetted into multiple 0.2-ml tubes. The presence of a single cell in a tube was confirmed by microscopy. GAPDH counts in the single cells were determined using the method outlined in Fu et al. (15). Measurements were obtained from eight cells. An average of 214 ± 36 (SE) copies (range 113 to 433 copies) was obtained per cell. GAPDH counts per spiked Ramos cell (18 single cells) were compared to these two methods to evaluate RNA detection efficiency.

## Supplementary Materials

www.sciencemag.org/content/347/6222/1258367/suppl/DC1

Figs. S1 to S15

Tables S1 to S3

References

Supporting data

## References and Notes

1. Acknowledgments: We thank M. Simbirsky for primer design; F. Lee, J. Wilhelmy, and K. Cordes-Metzler for technical assistance; and M. Davis for reviewing the manuscript. G.K.F. and S.P.A.F. have been issued US Patent 8,835,358 relating to molecular counting and G.K.F., S.P.A.F., and C.F. have submitted patent applications (US61/952,036, US62/012,237, US61/871,232, US14/472,363, and PCT/US2014/053301) relating to the work described. Patents are held by Cellular Research, Inc. CytoSeq data are available in the supplementary materials. CytoSeq reagents (barcoded beads and microwell arrays) are available from Cellular Research, Inc. Raw sequence data are available at www.cellular-research.com.
View Abstract