Research Article

Molecular recordings by directed CRISPR spacer acquisition

See allHide authors and affiliations

Science  29 Jul 2016:
Vol. 353, Issue 6298, aaf1175
DOI: 10.1126/science.aaf1175

Structured Abstract

INTRODUCTION

Although recent advances in DNA synthesis and sequencing technologies have made practical the writing and readout of arbitrary data in the form of synthetic DNA, still lacking are the robust tools necessary to generate a dynamic record of such information within the genomes of living cells. An in vivo system, built out of biological parts with large storage capacity, would enable the recording of defined biological events into stable genetic memory and facilitate the tracking of long molecular and cellular histories.

RATIONALE

The CRISPR (clustered regularly interspaced short palindromic repeats)–Cas system is a prokaryotic type of immunological memory. Foreign DNA sequences originating from viral infections are stored within genome-based arrays in the form of short sequences—called spacers—that confer sequence-specific resistance to the invading nucleic acids. These arrays not only preserve the spacer sequences but also record the order in which the sequences are acquired, generating a temporal record of acquisition events. We harnessed this system to record arbitrary DNA sequences into a genomic CRISPR array in the form of spacers acquired from synthetic oligonucleotides electroporated into a population of cells overexpressing the CRISPR adaptation proteins Cas1 and Cas2. This enabled the recording of defined molecular events into a stable genomic locus over time and the storage of arbitrary information across a population of cells.

RESULTS

We show that the Cas1-Cas2 complex can be used in vivo to integrate synthetic DNA of a defined sequence into the Escherichia coli genome. We used this feature to examine the type I-E CRISPR-Cas spacer acquisition process and optimized the synthetic spacer design to achieve higher acquisition efficiency and specific integration orientation through the addition of an AAG protospacer adjacent motif (PAM). We then generated stable genomic recordings of multiple molecular events by electroporating sets of oligonucleotides over several days. These molecular records were read out with high-throughput sequencing and then decoded with a program that identified and faithfully reconstructed the temporal event order.

Last, we used directed evolution to generate many Cas1-Cas2 mutants with modified PAM specificity (PAMNC). By modulating expression of these mutant and wild-type Cas1-Cas2 complexes, we could dynamically control the orientation of spacer integration. This enabled us to record acquisition events in multiple modes. That is, information was encoded in both the temporal order of the spacers and the orientation in which they were integrated.

CONCLUSION

Our results establish a recording system that uses the nucleotide content, temporal ordering, and orientation of defined DNA sequences within a CRISPR array in order to encode arbitrary information within the genomes of a population of cells. Because information can be encoded in spacer nucleotide space (up to two bits per base) and in alternate modes, the system has the potential to record and permanently store higher capacities of information than any other synthetic biological system to date. This lays the foundation for an in vivo recording device that could be coupled with diverse molecular phenomena and used for applications that require tracing of long molecular histories. We also demonstrate that delivery of synthetic DNA substrates to a CRISPR-Cas adaptation system in vivo is a practical method to probe and adapt the system.

Two modes of encoding information into the CRISPR locus.

(A) Oligonucleotides containing an AAG PAM and 32 variable bases were electroporated into cells overexpressing Cas1-Cas2 and inserted into the genomic CRISPR array. Delivery of oligos with distinct sequence over time generates a molecular record. (B) Cas1-Cas2 mutants identified through directed evolution alter the orientation of acquisition. Varying expression ratios of wild-type and mutant Cas1-Cas2 over time generates a record encoded in spacer orientation.

Abstract

The ability to write a stable record of identified molecular events into a specific genomic locus would enable the examination of long cellular histories and have many applications, ranging from developmental biology to synthetic devices. We show that the type I-E CRISPR (clustered regularly interspaced short palindromic repeats)–Cas system of Escherichia coli can mediate acquisition of defined pieces of synthetic DNA. We harnessed this feature to generate records of specific DNA sequences into a population of bacterial genomes. We then applied directed evolution so as to alter the recognition of a protospacer adjacent motif by the Cas1-Cas2 complex, which enabled recording in two modes simultaneously. We used this system to reveal aspects of spacer acquisition, fundamental to the CRISPR-Cas adaptation process. These results lay the foundations of a multimodal intracellular recording device.

DNA has the potential to encode, preserve, and propagate information (1). The precipitous drop in DNA sequencing cost has now made it practical to read out this information with high throughput (2). However, the ability to write arbitrary information into DNA, in particular within the genomes of living cells, has been restrained by a lack of biologically compatible recording systems that can exploit anything close to the full encoding capacity of nucleic acid space.

A number of approaches aimed at recording information within cells have been explored (3). These systems can be broadly divided into those that alter transcription through feedback loops and toggles (414) and those that encode information permanently into the genome, most often using recombinases to store information via the orientation of DNA segments (1519). Although the majority of these systems are effectively binary, efforts have also been made toward analog recording systems (20) and digital counters (21). Despite these efforts, the recording and genetic storage of little more than a single byte of information (18) has remained out of reach.

Immunological memory is essential to an organism’s adaptive immune response and hence must be an efficient and robust form of recording molecular events in living cells. The CRISPR-Cas system is a recently understood form of adaptive immunity used by bacteria and archaea (22). This system records past infections by storing short sequences of viral DNA within a genomic array. These acquired sequences are referred to as protospacers in their native viral context and as spacers once they are inserted into the CRISPR (clustered regularly interspaced short palindromic repeats) array. New spacers are integrated into the CRISPR array ahead of older spacers (23). Over time, a long record of spacer sequences can be stored in the genomic array, arranged in the order in which they were acquired. Thus, the CRISPR array functions as a high-capacity temporal memory bank of invading nucleic acids.

We harnessed the CRISPR-Cas system to record specific and arbitrary DNA sequences into a bacterial genome. We could generate a record of defined sequences, recorded over many days and in multiple modalities. In exploring this system, we also elucidated fundamental aspects of native CRISPR-Cas spacer acquisition and leveraged this knowledge to enhance the recording system.

A type I-E CRISPR-Cas system accepts synthetic spacers in vivo

Overexpression of the Escherichia coli type I-E CRISPR-Cas proteins Cas1 and Cas2 is sufficient to drive acquisition of new spacers in a strain containing two genomic CRISPR arrays but lacking endogenous Cas proteins (BL21-AI) (23). We replicated this result (Fig. 1A) and similarly found that new spacers were consistently integrated into the first position of array I directly adjacent to the leader with a consistent size of 33 bases (fig. S1, A and B). These spacers were drawn in roughly equal number from the cell’s own genome and from the plasmid used to overexpress Cas1 and Cas2 (Fig. 1B). Considering the overall DNA content of the cell, this ratio of genome-to-plasmid–derived spacers represents a substantial bias toward the plasmid as a protospacer source (24). Despite this bias, new spacers were drawn from a diverse range of sites around the genome and plasmid (Fig. 1C) and, besides the overrepresentation of a 5′ AAG protospacer adjacent motif (PAM), there was no way to predict a priori the full sequence of a new spacer without sequencing the expanded array.

Fig. 1 Acquisition of synthetic spacers.

(A) Schematic of the minimal elements of the type I-E CRISPR acquisition system used, including Cas1, Cas2, and array with leader (L), repeat (R), and spacer (S) along with PCR detection of an expanded array after the overnight induction of Cas1-Cas2. (B) Origin of new spacers (plasmid or genome), mean ± SEM. (C) Genome- and plasmid-derived spacers after overnight induction are mapped back to the approximate location of their protospacer (marked in red). (D) Array expansion (top) and specific acquisition of synthetic oligo protospacer (bottom) after electroporation. Top schematic shows the experimental outline. Schematics under each gel show specific PCR strategy. (E) Sequence-specific acquisition in either the forward (top) or reverse (bottom) orientation after electroporation with various single- and double-stranded oligos. 5′PT indicates phosphorothioate modifications to the oligos at the 5′ ends. (F) Time course of expansion after electroporation, mean ± SEM. (G) Percent of arrays expanded by spacer source as a function of electroporated oligo concentration, mean ± SEM. (H) Position of new spacers relative to the leader, mean ± SEM. (I) Size of new spacers in base pairs, mean ± SEM. All gels are representative of ≥3 biological replicates; * P < 0.05. Additional statistical details are provided in table S1.

To extend the function of the CRISPR acquisition system into a synthetic device for recording molecular events, it is necessary to direct the system to capture spacers of specific, defined sequence. In vitro, Cas1 and Cas2 can mediate integration of synthetic 33–base pair (bp) DNA oligos into plasmid-based arrays (25). We reasoned that similarly supplying an exogenous source of protospacers to the system within a cell might direct sequence-specific spacer acquisition in vivo. We therefore passaged an overnight culture of E. coli BL21-AI containing arabinose- and isopropyl β-D-1-thiogalactopyranoside (IPTG)–inducible Cas1 and Cas2 genes with or without arabinose and IPTG for 2 hours. We then electroporated the cells with a complementary pair of 33-base oligos (protospacer ps33), which matched the sequence of the most abundant M13-derived spacer found after phage infection of a native type I-E system (26). After incubating the cells for another 2 hours after transformation, we checked the genomic array for expansion and specific integration of the synthetic protospacer into the array by means of polymerase chain reaction (PCR) (Fig. 1D). By using the reverse sequence of the supplied oligo as the reverse primer, we also observed amplification of specifically sized PCR products that confirmed acquisition of the oligo-supplied sequence when Cas1 and Cas2 were induced or (more weakly) uninduced, but never for the case in which the oligos were not supplied. We confirmed that the specific ps33 nucleotide sequence was present within a fraction of the expanded arrays by means of Sanger sequencing. These results demonstrate that the CRISPR-Cas system acquired a sequence-specific spacer.

To better understand both the properties of this synthetic system as well as the fundamental properties of Cas1-Cas2–mediated spacer acquisition, we altered the oligos that we provided via electroporation. The system required both complementary strands for acquisition, and the double-stranded protospacer could insert in either direction (Fig. 1E). We modified the 5′ ends of the oligos with phosphorothioate bonds to help resist degradation by cellular nucleases but found no differences in acquisition efficiency (Fig. 1E). We tested whether RNA could serve as a protospacer by supplying either one or both of the oligo strands as RNA but detected no sequence-specific integration of RNA oligos (fig. S1D).

To investigate these results more quantitatively, we performed a PCR across the array (as in Fig. 1D) and subjected the resulting amplicon to high-throughput sequencing on an Illumina MiSeq platform. We quantified the percentage of all arrays that were expanded at the completion of an experiment, as well as the spacer source. Coupled with quantitative PCR, we generated a time course of spacer acquisition (Fig. 1F). Sequence-specific acquisitions occurred as early as 20 min after electroporation, reaching ~4% of all arrays by 2 hours. The oligo concentration required to achieve spacer acquisition was determined by testing a twofold dilution series (Fig. 1G and fig. S1E). Whether oligos were delivered or acquired as spacers had no effect on the genome- or plasmid-derived spacers. Thus, protospacer availability in the cell may be a limiting factor in spacer acquisition. On the other hand, the addition of an additional CRISPR array on the expression plasmid had little to no effect on the acquisition frequency of new spacers into the endogenous genomic array (Fig. 1G). Like genome- and plasmid-derived spacers, the synthetic spacers were inserted into the first (or occasionally first and second) positions of the array, and the great majority were of 33 bases (Fig. 1, H and I). Loss of previously acquired spacers has been reported both in the presence (27, 28) and absence (29, 30) of selective pressure. Although our analysis was restricted to the leader-proximal spacers, we did find rare instances in which the previous first spacer was deleted (0.096% of arrays sequenced ±0.012 SEM).

PAMs modify the efficiency and directionality of spacer acquisition

Data from sequencing millions of expanded arrays showed that genome- and plasmid-derived protospacers were drawn in equivalent numbers from the forward and reverse strands overall, with the only apparent bias being toward the genomic origin of replication (Fig. 2A). Similarly, oligo-derived protospacers were found in equal proportions in the forward and reverse orientation in the array (Fig. 2B). When we further examined the context of the genomic- and plasmid-derived protospacers, we found strong evidence for a PAM on the 5′ end of the protospacer consisting of two adenines at positions –2 and –1 from the spacer and a strong bias for a guanine as the first spacer base (Fig. 2C). This is largely consistent with previous characterizations of the E. coli type I-E system (31, 32). An interior sequence motif at the 3′ end of the spacer termed the acquisition-affecting motif (AAM) has also been reported for this system (31). We found spacer sequences that are consistent with the presence of this interior motif, but the frequency of its occurrence is minor compared with the 5′ PAM.

Fig. 2 PAMs modify the efficiency and orientation of spacer acquisition.

(A) Genome derived (count/10 kb) and plasmid-derived (coverage/base) spacers mapped to their protospacer location on the forward (purple) or reverse (green) strands. (B) Direction of oligo-derived spacers in the forward (purple) or reverse (green) orientation, mean ± SEM. (C) Representative sequence pLOGO (46) generated based on 896 distinct genome- and plasmid-derived protospacers. Five bases of the protospacer are included at each end of the spacer. (D) Plot of the summed spacer coverage mapped to the plasmid among three replicates at each nucleotide for a 553-nucleotide stretch. Carrots demarcate canonical PAMs on the forward (purple) or reverse (green) strand. Scale bar, 33 bases. Individual replicates are shown below. (E) Percent of arrays expanded by spacer source for different oligo protospacers, mean ± SEM. (F) Ratio of oligo-derived spacers acquired in the forward versus reverse orientation for different oligo protospacers, mean ± SEM. (G to J) Normalized representation of oligo-derived spacers by base acquired in the forward and reverse direction for each oligo. Bars in (I) and (J) are 33 bases long to show dominant and minority spacers drawn from the oligo protospacers. For all panels, * P < 0.05. Additional statistical details are provided in table S1.

Although there is no bias in forward- or reverse-strand–derived protospacers from the genome or plasmid on the whole, a sharper picture emerged at the level of individual nucleotides. For example, examining one small stretch of the plasmid (~550 bases), asymmetric peaks of spacer coverage—that is, the cumulative count of each time a given nucleotide was observed within an acquired spacer—emerged (Fig. 2D). Plotting the forward and reverse PAMs along the same stretch of plasmid revealed that in addition to biasing toward specific sequences for acquisition, the PAM also specified the orientation of integration into the array. Although nearly every protospacer that contained a PAM was acquired as a spacer, not all were acquired at the same frequency (Fig. 2D).

The presence of Chi sites—an eight-base motif in which double-strand break repair is more likely to occur—within a genome or plasmid biases the frequency of protospacer acquisitions (24). However, we wondered whether the sequence of the protospacer itself might also bias acquisition frequency. We ranked every PAM (AAG)–containing potential protospacer in the plasmid according to the frequency at which it was acquired into the genomic array (fig. S2A). We searched for characteristics among protospacers, including GC percentage and free energy, that might explain the difference in acquisition frequency, but failed to identify a correlation (fig. S2, B and C). For a direct test, we selected and synthesized three protospacer sequences (including their 15-bp flanking regions): one each from the high (psH), middle (psM), and low (psL) end of the frequency spectrum (fig. S2A). We then electroporated each of these oligo protospacers into cells expressing Cas1-Cas2 from an alternate plasmid that did not include these particular sequences. psL was acquired much less frequently than psH or psM (fig. S2F). To determine whether this was caused by the sequence of the spacer itself or a flanking region, we swapped the 15-bp flanking regions of psH with those of psL and vice versa (psH/L and psL/H, respectively). Again, the psL/H spacer was acquired at a lower frequency than was psH/L, independent of the flanking regions. These results indicate that the sequence of the protospacer itself influences the efficiency of acquisition. We do not know, however, the mechanism of this effect, whether by a direct effect on the acquisition process itself or by indirect effects such as sequence-dependent interactions with endogenous nucleotides, competing proteins, or degradation.

Given that spacers are selected from the genome and plasmid according to an adjacent sequence, we wondered whether the inclusion of a PAM in our synthetic protospacer ps33 would alter acquisition frequency. We designed three additional oligo protospacers: psAA33, in which two adenines were included at the 5′ end of ps33 to create the entire canonical AAG PAM; ps10AA33, which includes an additional 10 5′ nucleotides; and ps10TC33, in which the AA of the PAM was mutated to TC to create a noncanonical PAM (PAMNC). Using these oligos, we found that the inclusion of a PAM greatly increased the efficiency of sequence-specific acquisition (Fig. 2E). Whether preceded by 10 extra nucleotides or not, oligos with the AAG PAM (psAA33 and ps10AA33) were acquired at greater than five times the frequency of those that did not include a PAM (ps33). Conversely, including the TCG PAMNC did not change acquisition frequency relative to ps33 (Fig. 2E).

In line with what has been previously observed for the PAM motif in CRISPR adaptation—that it is consistently localized to the leading rather than trailing end of the integrated spacer (24, 31, 3336)—the inclusion of a PAM also altered the orientation frequency of oligo-derived spacer acquisition. Whereas ps33 and ps10TC33 were acquired equally in both orientations, psAA33 and ps10AA33 were acquired almost exclusively in the forward orientation (Fig. 2, F to J, and fig. S3A). Consistent with the type I-E preference for an AAG PAM, psAA33 and ps10AA33 were consistently inserted with nucleotide G1 as the first base of the spacer (Fig. 2, H and I). In contrast, ps10TC33 lacked a single dominant spacer product and was inserted at several different PAMsNC (Fig. 2J). We verified that both Cas1 and Cas2 were necessary for synthetic spacer integration, whereas Cas2 nuclease activity was not required (fig. S3, B and C) (25). Therefore, the inclusion of a PAM in synthetic protospacers dictates both the efficiency and orientation of the spacer that is acquired by the Cas1-Cas2 complex.

A molecular recording over time

We tested whether we could harness the acquisition of specific spacer sequences to record a series of synthetic spacers into a population of cells over time. As an initial test, we recorded three unique elements (1 × 3) into a single culture of E. coli by sequentially electroporating a series of three different oligo protospacer sequences into the culture, over a period of 3 days (one protospacer each day) (fig. S4A). After sequencing a population of the arrays on day 3, we could reconstruct the order in which the spacers were delivered (fig. S4, B and C, and materials and methods). To further probe the limits of this system, we recorded 15 distinct elements (3 × 5): three sets of five protospacers, electroporated three at a time over 5 days (Fig. 3A). The analysis of both the 1 × 3 and 3 × 5 recordings are conceptually similar, so we will discuss the latter in detail (fig. S4B and Fig. 3B, respectively).

Fig. 3 A molecular recording over time.

(A) Experimental outline of the 3 × 5 recording. Over 5 days, three sets of five oligo protospacers (15 elements) were electroporated (one protospacer from each of the three sets each day) into cells expressing Cas1-Cas2. Time points at which cells were sampled for sequencing are numbered 1 to 6. (B) Schematic illustrating all possible pairwise ordering of new spacers. G/P denotes a spacer derived from the genome or plasmid. Ordering rules are shown below. In the case of y = z, asterisk indicates a tolerance within ±20% of the mean of both values. (C) At each of the six sample points [marked in (A)], percent of all arrays expanded with synthetic spacers from each of the indicated rounds, mean ± SEM. (D) Single, double, and triple expansions for each round, mean ± SEM. (E) Percent of all expansions at sample point six, broken down by electroporation round and set. Open circles are individual replicates; filled bars are mean ± SEM. (F) Results of ordering rule analysis for one replicate across each set. For all 120 permutations, results of the tested rule are shown (green indicates pass, red indicates fail). For all sets, only one permutation passed all rules and in every case that permutation matched the actual order in which the oligos were electroporated (as indicated by check mark). Additional statistical details are provided in table S1.

For the 3 × 5 recording, all oligo protospacers consisted of 35 nucleotides, beginning with a 5′ AAG PAM followed by a five-base barcode (specific to each of the three sets) and 27 more bases (specific to each of the 15 protospacers). At the end of the 3 × 5 recording, nearly a quarter of all arrays in the cell population contained at least one oligo-derived spacer, with spacers from each round of electroporation represented in roughly equivalent proportions (Fig. 3, C and D). Individual variations among the spacer acquisition frequency were more heavily driven by spacer nucleotide sequence than by the round in which they were acquired (Fig. 3E), whereas loss of recorded spacers after acquisition was rare (0.076% ± 0.182 SEM).

Because of the low probability of acquiring spacers from every round in any single array (Fig. 3D), successful readout of the recording required analysis of a population of arrays. Therefore, we sequenced the first three spacers of each array (moving in from the leader) and considered only the order of pairs of newly acquired spacers (Fig. 3B). For any given synthetic spacer pair within the same set, the order should follow a predictable rule: Among all arrays that contain any two new spacers, a spacer electroporated in an earlier round will always be found further from the leader than a spacer introduced at a later round. We also gained information by considering the arrangement of oligo-derived spacers in relation to newly acquired genome- and plasmid-derived spacers. Because the endogenous spacers will accumulate over time, synthetic spacers from an earlier round will be paired more often with a new genome/plasmid spacer in one direction (toward the leader) than in the other (relative to the synthetic spacer), and vice versa for oligo-derived spacers from a later round. With five possible spacers (in each set), we considered all possible pairwise comparisons and generated 15 ordering rules from which we can reconstruct the order of the entire set (Fig. 3B). We took the sequences of arrays after the completion of the 3 × 5 recording and passed them through an algorithm that, with the only sequence-based input being the sequence of the CRISPR repeat, would predict all oligo-derived spacer sequences, assign them to a set according to the barcodes, and then test all possible permutations of the sequence against the 15 ordering rules. For each set, only one permutation satisfied all 15 ordering rules, and in every case, that permutation matched the actual order of electroporated oligos (Fig. 3F). Although we analyzed ~2 million reads for each replicate, we found that order could be correctly reconstructed in most cases with 20,000 reads or fewer. Thus, we could reliably record and read out the 15-element recording.

Cas1-Cas2 PAM recognition can be modified

The ability to control not only the sequence of new spacers but also the orientation of new spacer integration would enable recording of information in multiple modalities simultaneously. Because the addition of a 5′ AAG PAM on our synthetic spacers controlled the orientation of new acquisitions (Fig. 2F), we sought to modify integration orientation by altering PAM recognition of Cas1-Cas2. To do this, we performed the directed evolution approach shown in Fig. 4A. First, we generated a large library of random Cas1-Cas2 mutants by means of error-prone PCR (fig. S5, A and B) and inserted this library into a plasmid upstream of a minimal CRISPR array. After cloning the plasmid library into BL21-AI, we induced and transformed mutants with a protospacer bearing the canonical 5′ AAG PAM on the forward strand and a noncanonical 5′ TCG PAMNC on the reverse strand. After outgrowth, we selected mutants using a forward primer ahead of the Cas1-Cas2 mutant genes and a reverse primer matching the PAMNC spacer sequence in order to yield specific amplification of only those mutants that had acquired the spacer in the (reverse) PAMNC orientation. A subset of these selected mutants were then tested for PAM specificity, and a separate subset were subjected to another round of selection for refinement before testing. For testing, individually selected mutant clones were induced overnight, and their expanded arrays were analyzed by means of sequencing. Specifically, we analyzed the PAMs of the all genome- and plasmid-derived spacers to determine what, if any, PAM specificity remained. Wild-type Cas1-Cas2 acquires spacers from AAG PAM protospacers at nearly the same frequency as from all other (non-AAG) PAM protospacers combined (Fig. 4B). In contrast, the majority of mutants we selected acquired non-AAG protospacers at a greater frequency than that of AAG protospacers (Fig. 4B). There was no gain in non-AAG acquisition frequency from the extra step of refinement (fig. S5C), so mutants from both subsets are shown together (Fig. 4B and fig. S5D).

Fig. 4 Directed evolution of PAM recognition.

(A) Schematic of the directed evolution. (B) Testing of selected mutants, plotting 5′ AAG versus non-AAG PAM protospacers normalized to count per 100,000 sequences. Scatter plot shows 65 induced mutants (open black circles), three induced wild-type replicates (open green circles), an uninduced wild type (open red circle), the average of the induced mutants (filled black circle), and the average of the induced wild types (filled green circle) ± SEM. Scatter plot to the right is an inset of the larger plot. (C) Heatmap of protospacer PAM frequency over the entire sequence space for wild-type Cas1-Cas2 (wt), mutants that increase or maintain AAG PAM specificity (m-27 and m-24), and mutants that lose AAG PAM specificity (m-74, m-80, and m-89). Numbers at top right correlate to numbers in (B). (D) A subset of selected mutants reassayed in triplicate as well as a subset of single-point mutants chosen from the original selection. All points are the average of three replicates ± SEM. (E) Crystal structure of Cas1-Cas2 complex bound to a protospacer (38). Inset highlights, in purple, residues in the Cas1 active site that (when mutated) decrease PAM specificity. The protospacer PAM complementary sequence (T30 T29 C28, numbering as in PDB ID 5DQZ) is also noted. Additional statistical details are provided in table S1.

To visualize shifts in PAM specificity, we plotted a heat map showing the normalized frequency of observed PAMs among all potential PAMs for wild-type Cas1-Cas2 and several selected mutants (Fig. 4C). Wild-type Cas1-Cas2 had strong selectivity for the canonical AAG PAM. A minority of mutants also retained (m-24) or even increased (m-27) this preference. However, many more mutants showed reduced or, in the case of the three mutants shown (m-74, m-80, and m-89), nearly no specificity for the canonical PAM. From the sequence of these selected mutants, we chose a subset of single-point mutations for follow-up analysis on the basis of repeated observations in the data set or location in the crystal structure of the Cas1-Cas2 complex (Fig. 4E and table S3) (3739). Most of the single-point mutants tested in isolation also reduced the PAM specificity compared with that of wild type (Fig. 4D and fig. S5D). These results demonstrate that PAM recognition by the Cas1-Cas2 complex can be modified by many different mutations without drastically reducing spacer acquisition efficiency.

Recording in a second modality

As a proof of concept, we selected a PAMNC Cas1-Cas2 mutant (m-89) (Fig. 4C and fig. S5D) to add an extra modality to the 1 × 3 recording (fig. S4). We subjected bacteria to three sequential rounds of electroporation, with each oligo protospacer containing a 5′ AAG PAM on the forward strand and a 5′ TCG PAMNC on the reverse (Fig. 5A). We controlled expression of wild-type Cas1-Cas2 and m-89 using different inducible promoters (pLTetO and pT7lac, respectively) on the same plasmid (Fig. 5B). We split the bacteria between two conditions, each alternating between T7lac and tet induction from round to round. We found that cells of both conditions acquired spacers from each round at similar frequencies, indicating that transcription and integration activity of the wild-type and m-89 Cas1-Cas2 were both adequate (Fig. 5C). At the completion of the recording, we compared the orientation of each spacer between the two conditions. The ratio of forward- to reverse-oriented spacers shifted toward PAMNC (reverse) during tet induction (Fig. 5, D and F). After normalization for the total spacer orientation ratio for each spacer, we could clearly discriminate which cultures had been exposed to each inducer at each time point on the basis of only the direction of integration (Fig. 5G). Thus, this system can simultaneously record in two modalities.

Fig. 5 Recording in an additional mode.

(A) Outline of the recording process. Three different synthetic protospacers (each containing a 5′ AAG PAM on the forward strand and a 5′ TCG PAM on the reverse) were electroporated over 3 days (one protospacer each day) into two bacterial cultures under different induction conditions (shown below timeline). Sampling time points are numbered 1 to 3. (B) Schematic of the plasmid construct used, showing wild-type and PAMNC mutant (m-89) Cas1-Cas2 driven by independently inducible promoters (T7lac and pLtetO, respectively). The heatmap shows 5′ PAM specificity for wild type (boxed in yellow) and mutant m-89 (boxed in red). (C) At each of the three sample points [marked in (B)], percent of expanded arrays with spacers from each of the indicated rounds for the two conditions, mean ± SEM. (D to F) Ratio of synthetic spacers acquired in the forward versus reverse orientation for each round under each condition, mean ± SEM. (G) Ratio of forward to reverse integrations normalized to the sum of both possible orientations for each of the two conditions, mean ± SEM. For all panels, *P < 0.05. Additional statistical details are provided in table S1.

Discussion

We developed a CRISPR-Cas–based system to record molecular events into a genome in the form of essentially arbitrary synthetic DNA sequences. Although the information is only partially encoded within any given cell, the complete record remains distributed across a population of cells. To read out the recordings, we used high-throughput sequencing and only considered the pairwise order of any two new spacer sequences within single CRISPR arrays. From these many binary comparisons, a complete record of events could then be assembled, faithfully decoding the distributed memory fully preserved within the cell population. An important consideration of this system is that despite the necessary destruction of cells for readout at the end of the recording, the encoding process is not destructive. Thus, as opposed to sequential sampling of a population to generate a record of events, the current approach does not require that cells be destroyed while the experiment is ongoing. Moreover, because the recording is distributed across a population, only a fraction of the population needs to be sampled to retrieve the recording.

We uncovered details of the native CRISPR-Cas adaptation system. Integration of synthetic oligo sequences in vivo by the Cas1-Cas2 protein complex enabled us to directly assess detailed aspects of protospacer acquisition. Because the frequency of spacers acquired from the genome and plasmid is largely unaltered in the presence of oligo-derived acquisition (Figs. 1G and 2E), we conclude that the availability of adequate protospacers is likely one limiting aspect of the adaptation system. The presence of a 5′ AAG PAM modulated both the frequency and orientation of spacer acquisition, and the interior sequence of the protospacer influenced acquisition efficiency.

Directed evolution allowed us to experimentally modify PAM recognition of the Cas1-Cas2 complex, which enabled us to generate a record in multiple modalities simultaneously. This directed evolution method required no structural information and should be generally applicable to evolving other activities of CRISPR-Cas proteins by coupling them to the spacer acquisition process (for example, modifying target site specificity).

There are challenges to directly comparing between different cellular recording approaches. For instance, some are rewritable (47, 914, 17, 20, 21), whereas others, similar to our system, create permanent records (15, 1721). To date, the highest permanent storage capacity of a synthetic in vivo recording device was achieved by using 11 orthogonal recombinases, capable of 211 (2,048) distinct states, capturing 1.375 bytes of information within a single cell (18). In our 3 × 5 recording, we encoded 15 individual elements within a population of cells. However, because this system can record arbitrary defined sequences, the number of possible states is expanded dramatically. With an invariable G at the beginning of the spacer and a five-base set identifier, 27 bases remain that could encode information, yielding 427 possible distinct sequences per spacer. It was possible to encode the order within each set to at least five elements, resulting in a specific state capacity for each set based on the permutation P(427,5) = 1.9 × 1081, or 5.7 × 1081 combining the three sets and assuming set independence. If we include interdependence between each set, total distinct states would rise to (427)15 or ~7 × 10243. As a point of comparison, the number of atoms in the observable universe is estimated at 1 × 1080.

Moving from theoretical to practical considerations, the information capacity of a given recording in our system depends on the degree to which the sequence of the protospacer is constrained. If there are no sequence constraints on the protospacer, and thus any arbitrary sequence is available, then the 15 recorded spacers (in the 3 × 5 recording paradigm) each contain 27 bases of recording potential at four bases per byte, yielding 101.25 bytes per recording. Throughout our experiments, we were able to vary the nucleotide identity at every one of these 27 positions in our oligo protospacers. However, we have not explicitly tested, nor is it practical to test, all possible protospacers for viability. Moreover, we have shown that the sequence of the protospacer can influence acquisition frequency, so it is reasonable to assume that not all possible sequences will be suitable protospacers.

We can set an absolute lower limit on the information capacity of the 3 × 5 recording presented here by assuming that the particular sequences that we used in the recording are the only possible sequences that could be used. In that case, we can encode information only in the order of the sequences recorded in three sets of five possible spacers, disallowing repetition. In this case, the bits per set is given by log2[P(5,5)] = ~6.9 bits or ~2.59 bytes, summing all three sets.

However, to assume that no other sequences are allowable is conservative. For instance, considering just the new spacers that were observed in this work, there were 48,773 genome-derived, 186 plasmid-derived, and 23 oligo-derived spacers of 33 bases that included an AAG PAM in their protospacer context. Using this pool of validated sequences in our recording paradigm would yield log2[P(48982,5)] = ~77.9 bits per set, or ~29.21 bytes of potential encoding capacity for all three sets. Again, this estimation is certainly overconstrained because these sequences are drawn from an incredibly small subset of all possible sequences. Nonetheless, in the interest of being cautious, we can say that the recording capacity of the 3 × 5 paradigm is not less than 2.59 bytes nor more than 101.25 bytes and likely falls somewhere between 29.21 and 101.25 bytes. By also considering the ability to control spacer orientation (an extra modality), we could potentially encode an additional 5 bits per set. Of course, this only reflects the information of our current recordings, which we arbitrarily limited to 15 spacers. Native species have been found with as many as 458 spacers in a single cell (S. tokodaii) (40). This illustrates the potential space to encode complex biological phenomena, such as the transcriptional time course of many genes in a cell by means of reverse transcription of mRNA protospacers (41). We anticipate that such a recording system will be valuable in applications that require tracing long histories of in vivo cellular activity, including development, lineage, and activity in the brain (42, 43).

Materials and methods

Bacterial strains and culturing conditions

Expression and new spacer acquisition were carried out in BL21-AI cells. Unless otherwise specified, cells were grown in Luria Broth (LB) shaking (240 rpm) at 37°C. Genes expressed from the T7lac promoter were induced using L-arabinose (Sigma- Aldrich) at a final concentration of 0.2% (w/w) from a 20% stock solution in water and isopropyl-beta-D-thiogalactopyranoside (IPTG; Sigma-Aldrich) at a final concentration of 1mM from a 100mM stock solution in water. Cas mutants expressed from the pLtetO promoter were induced via anhydrotetracycline (aTc; Clontech) at a final concentration of 214nM from a 214μM stock in 50% ethanol. While expressing from the pLtetO promoter, 0.2% glucose was added to reduce unintended background expression from the T7lac promoter. For new spacer acquisition experiments not involving oligo-derived spacers, cells were induced and grown overnight (16h). All cloning was performed using NEB5α cells.

Cloning and library construction

Plasmid containing Cas1 and Cas2 under the expression of a T7lac promoter (pWUR 1+2) was a generous gift of Udi Qimron (23). A variant of this plasmid was created harboring an additional CRISPR array based on an array found in the K12 strain. This additional array was synthesized and cloned into pWUR 1+2 to generate pWUKI 1+2. Cas1+2 were cloned into pRSF-DUET for a different plasmid context (pRSF-DUET 1/2). Cas1 and Cas2 were extracted from pWUR 1+2 by PCR and re-cloned into the same plasmid separately. In the case of Cas1, the selection was also changed in this step from spectinomycin to ampicillin to create pWURA Cas1 and pWUR Cas2. The point mutation E9Q was introduced into Cas2 by PCR to generate pWUKI Cas1+Cas2 E9Q. Similarly, point mutants of Cas1+2 based on mutants from the directed evolution experiment were created by PCR. Mutant 89 from the directed evolution experiment was cloned into pWUR 1+2 along with a terminator, pLtetO, and the tetR repressor from pJKR-H-tetR (44) to create pWUR 1+2 tetO mut89. Mutant library was created via error-prone PCR using GeneMorph II Random Mutagenesis Kit (Agilent) and cloned into ElectroTen-Blue ultracompetent cells (Agilent) before being transferred to the expression strain (BL21-AI). For additional details see plasmid table (table S2).

Oligo protospacer electroporation

For spacer acquisition experiments involving oligo-derived spacers, cells were first grown overnight from individual plated clones. In the morning, 100μl of the overnight culture was diluted into 3ml of LB, with induction components as dictated by the experiment. Cells were grown with inducers for 2h. For an individual experimental condition, 1ml of this culture was pelleted and re-suspended in water. Cells were further washed by two additional pelleting and re-suspension steps, then pelleted a final time and re-suspended in 50μl of a 3.125μM solution of double stranded oligonucleotides (unless otherwise noted) synthesized by IDT (Integrated DNA Technologies). All pelleting steps were via centrifugation at 13,000xg for 1 min and the entire process from the first pelleting to the final re-suspension was carried out at 4°C. Finally, the cell-oligo mixture was transferred to a 1mm gap cuvette and electroporated using a Bio-Rad gene pulser set to 1.8 kV and 25 μF with pulse controller at 200 Ω. Only those conditions with an electroporation time constant > 4.0 ms were carried through to analysis. Immediately after electroporation, cells were transferred into a culture tube containing 3ml of LB and grown for 2h (unless otherwise noted). At this time, 50μl of the culture was lysed by heating to 95°C for 5 min, cooled, then either used directly for analysis or saved for later analysis at -20°C. For multi-day recordings, 50ul of the culture was used to inoculate an overnight culture (in the absence of inducers) to restart the process the next day.

Analysis of spacer acquisition

Qualitative assessment of new spacer acquisition was achieved by PCR across the array (for all expansions) or PCR from either side of the array with the opposite primer matching the oligo that was electroporated (for sequence-specific acquisition). New spacer sequences were assigned to their origin in initial experiments by TOPO cloning (ThermoFisher) the expanded amplicons, followed by Sanger sequencing of the resulting colonies. For the majority of experiments, however, acquisition events were assessed by sequencing a library of all expanded and unexpanded arrays for a given condition using an Illumina MiSeq sequencer. Libraries were created from an initial PCR across the genomic array, then single- or dual-indexed using NEBNext Multiplex Oligos (NEB). Up to 96 conditions were run per flow cell. A list of oligo protospacers used can be found in table S4.

Processing and analysis of MiSeq data

Sequences were analyzed using custom written software (Python). Briefly, spacer sequences were extracted from reads based on their arrangement between identifiable repeat sequences (four mismatches permitted in the repeat to allow for errors in sequencing), then compared against the sequences of spacers that populated the array prior to the experiment (five mismatches allowed against old spacers) to identify new spacers. At this time, metrics were collected as to the number of expanded versus unexpanded arrays, the number of expansions in each array, the position of new expansions, and the length of new spacers. The sequences of new spacers were then blasted (NCBI, blastn) against a database containing the genome, plasmid, and any electroporated oligo sequences. From this, origin and orientation were determined as was the protospacer flanking sequence for PAM analysis. To analyze the recordings over time, all reads containing double and triple expansions were analyzed. Oligo-derived sequences were identified based on their frequency among all new spacers, then, if applicable, set identifiers were extracted based on their known location in the sequences and sets of oligo-derived sequences were assembled. The order of all oligo-derived spacers relative to each other and genome- or plasmid-derived spacers in pairwise comparisons in all double and triple expanded arrays was assessed. Then, those values were used to test all ordered permutations of the oligo-derived across each of the ordering rules. Sets were analyzed independently. An estimate of the time course of spacer acquisition was inferred by relative qPCR Ct values at all time points, referenced to a quantitative analysis of expansions by MiSeq at the two-hour time point. Library sizes for various mutant libraries were estimated by sequencing of fragmented mutant amplicons on a MiSeq sequencer. Sequence diversity was estimated as Embedded Image, where Sobs is the number of observed unique sequences in the sample, F1 is the number of sequences with a single occurrence and F2 is the number of sequences with exactly two occurrences (45).

Statistics

See table S1.

Supplementary Materials

www.sciencemag.org/content/353/6298/aaf1175/suppl/DC1

Supplementary Text

Figs. S1 to S5

Tables S1 to S4

References and Notes

Acknowledgments: S.L.S., J.N., J.D.M, and G.M.C. are inventors on a provisional patent (62/296,812) filed by the President and Fellows of Harvard College that covers the work in this manuscript. S.L.S is a Shurl and Kay Curci Foundation Fellow of the Life Sciences Research Foundation and received additional support from the National Institute on Aging (grant 5T32AG000222). The project was supported by grants from the National Institute of Mental Health (grant 5R01MH103910) and National Human Genome Research Institute (grant 5P50HG005550) to G.M.C., the National Institute of Neurological Disorders and Stroke (grant 5R01NS045523) to J.D.M., and an Allen Distinguished Investigator Award from the Paul G. Allen Family Foundation to J.D.M. Sequence data will be deposited into the National Center for Biotechnology Information Sequence Read Archive database as appropriate, and plasmids will be available under a materials transfer agreement with Addgene.
View Abstract

Subjects

Navigate This Article