Research Article

The biochemical basis of microRNA targeting efficacy

See allHide authors and affiliations

Science  20 Dec 2019:
Vol. 366, Issue 6472, eaav1741
DOI: 10.1126/science.aav1741

Biochemical prediction of miRNA targeting

MicroRNAs (miRNAs) regulate most human messenger RNAs and play essential roles in diverse developmental and physiological processes. Correctly predicting the function of each miRNA requires a better understanding of miRNA targeting efficacy. McGeary et al. measured binding affinities between six miRNAs and synthetic targets, built a biochemical model of miRNA-mediated repression, and expanded it to all miRNAs using a convolutional neural network. This approach offers insights into miRNA targeting and enables more accurate prediction of intracellular miRNA repression efficacy than previous algorithms.

Science, this issue p. eaav1741

Structured Abstract

INTRODUCTION

MicroRNAs (miRNAs) are short RNAs that guide repression of mRNA targets. Each miRNA associates with an Argonaute (AGO) protein to form a complex in which the miRNA recognizes mRNA targets, primarily through pairing to sites that match its extended seed region (miRNA nucleotides 1 to 8) while the AGO protein recruits factors that promote destabilization and translational repression of bound targets. The miRNA targetome is vast, involving most mammalian mRNAs, and miRNA regulatory effects are consequential, with severe developmental or physiological defects often observed after deleting a broadly conserved miRNA (or set of paralogous miRNAs). Deeper understanding of these regulatory roles would be facilitated by a better understanding of miRNA targeting efficacy.

RATIONALE

In principle, targeting efficacy should be a function of the affinity between AGO-miRNA complexes and their target sites, in that greater affinity for a target site would cause increased occupancy at that site and thus increased repression of the target mRNA. However, the set of measured miRNA-target binding affinities has been sparse, and standard thermodynamic models of RNA-RNA pairing poorly predict affinities that have been measured. These limitations have prevented construction of an informative biochemical model of targeting efficacy, such that the best predictive performances have instead relied on indirect, correlative approaches. Here, we adapted RNA bind-n-seq (RBNS) and a convolutional neural network (CNN) to study miRNA-target interactions, thereby obtaining the quantity and diversity of affinity values needed to better understand and predict miRNA targeting efficacy.

RESULTS

Analysis of motifs enriched in RNA sequences bound to the AGO2–miR-1 complex provided unbiased identification of all miR-1 binding sites ≤12 nucleotides (nt) in length, and a newly developed computational procedure simultaneously inferred the relative dissociation constants (Kd values) of all of these sites. Repeating this procedure with AGO2 loaded with five other miRNAs (let-7a, miR-7, miR-124, miR-155, and lsy-6) revealed pronounced miRNA-specific differences in the relative affinities of canonical site types (defined as sites with ≥6-nt contiguous matches to the seed region). The analyses also revealed that each miRNA has a distinct repertoire of noncanonical site types and that dinucleotides flanking both sides of each site influence affinity by as much as 100-fold, primarily because of their impact on site accessibility. Most of the noncanonical sites paired to the seed region but did so with imperfections that reduced affinity to levels below those of the top four canonical sites. Nonetheless, for miR-124 and miR-155, noncanonical sites were identified with affinities approaching that of the top canonical site. These high-affinity noncanonical sites were larger and correspondingly rarer in mRNA sequences, which showed that canonical seed pairing is the most efficient way to achieve high-affinity binding.

The miRNA-specific differences in site repertoire and relative binding affinities corresponded to differential repression in cells, thereby enabling construction of a biochemical model of miRNA-mediated repression. This biochemical model predicts the occupancy at each site as a function of the Kd measured for the 12-nt sequence encompassing the site. The model outperformed the best correlative model, explaining ~60% of the relevant variation observed after transfecting a miRNA into cells. Although partly attributable to inclusion of noncanonical sites, the improved performance was primarily due to more accurate representation of the effects of canonical sites. Improved performance was extended to miRNAs without RBNS data by building a CNN that was trained with both RBNS-derived Kd values and mRNA-transfection fold-change measurements to predict binding affinity between any miRNA and any 12-nt sequence.

CONCLUSION

We replaced correlative models of targeting efficacy with a principled, biochemical model that explains and predicts about half of the variability attributable to the direct effects of miRNAs on their targets. The success of the model shows that site binding affinity is the major determinant of miRNA-mediated repression. It also shows that although active AGO-miRNA complexes are occupied primarily by canonical sites, noncanonical sites measurably contribute to repression in the cell. Repression efficacy predicted by this model will be available on the TargetScan website to provide improved guidance for placing miRNAs into gene-regulatory networks.

Biochemical modeling of targeting efficacy.

RBNS generates relative Kd values for an AGO-miRNA and 262,144 different 12-nt sequences with at least a weak match to the miRNA (left). Values for sites found within an mRNA (colored 12-nt sequences) are used to estimate site occupancy, thereby enabling prediction of mRNA repression. Either a shorter match to the seed region (upper right) or suboptimal flanking nucleotides that promote occlusive mRNA structure (upper middle) can reduce occupancy. Rel. Kd, relative Kd.

IMAGE: A. GODFREY/WHITEHEAD INSTITUTE

Abstract

MicroRNAs (miRNAs) act within Argonaute proteins to guide repression of messenger RNA targets. Although various approaches have provided insight into target recognition, the sparsity of miRNA-target affinity measurements has limited understanding and prediction of targeting efficacy. Here, we adapted RNA bind-n-seq to enable measurement of relative binding affinities between Argonaute-miRNA complexes and all sequences ≤12 nucleotides in length. This approach revealed noncanonical target sites specific to each miRNA, miRNA-specific differences in canonical target-site affinities, and a 100-fold impact of dinucleotides flanking each site. These data enabled construction of a biochemical model of miRNA-mediated repression, which was extended to all miRNA sequences using a convolutional neural network. This model substantially improved prediction of cellular repression, thereby providing a biochemical basis for quantitatively integrating miRNAs into gene-regulatory networks.

MicroRNAs (miRNAs) are ~22–nucleotide (nt) regulatory RNAs that derive from hairpin regions of precursor transcripts (1). Each miRNA associates with an Argonaute (AGO) protein to form a silencing complex, in which the miRNA pairs to sites within target transcripts and the AGO protein promotes destabilization and/or translational repression of bound transcripts (2). miRNAs are grouped into families on the basis of the sequence of their extended seed (nucleotides 2 to 8 of the miRNA), which is the region of the miRNA most important for target recognition (3). The 90 most broadly conserved miRNA families of mammals each have an average of >400 preferentially conserved targets, such that mRNAs from most human genes are conserved targets of at least one miRNA (4). Most of these 90 broadly conserved families are required for normal development or physiology, as shown by knockout studies in mice (1).

Deeper understanding of these numerous biological functions would be facilitated by a better understanding of miRNA targeting efficacy, with the ultimate goal of correctly predicting the effects of each miRNA on the output of each expressed gene. In principle, targeting efficacy should be a function of the affinity between AGO-miRNA complexes and their target sites, in that greater affinity to a target site would cause increased occupancy at that site and thus increased repression of the target mRNA. Until very recently, binding affinities have been known for only a few target sequences of only three miRNAs (511). In a recent study, high-throughput imaging and cleavage analyses provide extensive binding and slicing data for two of these three miRNAs: let-7a and miR-21 (12). Although these measurements provide insight and enable a quantitative model that predicts the efficiency of miR-21–directed slicing in cells (12), the sparsity of binding-affinity data still limits insight into how targeting might differ between different miRNAs and prevents construction of an informative biochemical model of targeting efficacy relevant to the vastly more prevalent, nonslicing mode of miRNA-mediated repression.

With insufficient affinity measurements, the most informative models of targeting efficacy rely instead on indirect, correlative approaches. These models focus on mRNAs with canonical 6- to 8-nt sites matching the miRNA seed region (Fig. 1A) and train on features known to correlate with targeting efficacy (including the type of site as well as various features of site context, mRNAs, and miRNAs), by using datasets that monitor mRNA changes that occur after introducing a miRNA (1316). Although the correlative model implemented in TargetScan7 performs as well as the best in vivo cross-linking approaches at predicting mRNAs most responsive to miRNA perturbation, it nonetheless explains only a small fraction of the mRNA changes observed upon introducing a miRNA [coefficient of determination (r2) = 0.14] (14). This low value indicates that prediction of targeting efficacy has room for improvement, even when accounting for the fact that experimental noise and secondary effects of inhibiting direct targets place a ceiling on the variability attributable to direct targeting. Therefore, we adapted RNA bind-n-seq (RBNS) (17) and a convolutional neural network (CNN) to the study of miRNA-target interactions, with the goal of obtaining the quantity and diversity of affinity measurements needed to better understand and predict miRNA targeting efficacy.

Fig. 1 AGO-RBNS reveals binding affinities of canonical and previously uncharacterized miR-1 target sites.

(A) Canonical sites of miR-1. These sites have contiguous pairing (blue) to the miRNA seed (red), and some include an additional match to miRNA nucleotide 8 or an A opposite miRNA nucleotide 1 (B represents C, G, or U; D represents A, G, or U). (B) AGO-RBNS. Purified AGO2–miR-1 is incubated with excess RNA library molecules that each have a central block of 37 random-sequence positions (N37). After reaching binding equilibrium, the reaction is applied to a nitrocellulose membrane and washed under vacuum to separate library molecules bound to AGO2–miR-1 from those that are unbound. Molecules retained on the filter are purified, reverse transcribed, amplified, and sequenced. These sequences are compared with those generated directly from the input RNA library. (C) Enrichment of reads containing canonical miR-1 sites in the 7.3 pM AGO2–miR-1 library. Shown is the abundance of reads containing the indicated site (key) in the bound library plotted as a function of the respective abundance in the input library. Dashed vertical lines depict the enrichment in the bound library; dashed diagonal line shows y = x. Reads containing multiple sites were assigned to the site with greatest enrichment. (D) AGO-RBNS profile of the canonical miR-1 sites. Plotted is the enrichment of reads with the indicated canonical site (key) observed at each of the five AGO2–miR-1 concentrations of the AGO-RBNS experiment, determined as in (C). Points show the observed values, and lines show the enrichment predicted from the mathematical model fit simultaneously to all of the data. Also shown for each site are Kd values obtained from fitting the model, listing the geometric mean ± the 95% confidence interval determined by resampling the read data, removing data for one AGO–miR-1 concentration and fitting the model to the remaining data, and repeating this procedure 200 times (40 times for each concentration omitted). (E) AGO-RBNS profile of the canonical and the newly identified noncanonical miR-1 sites (key). Sites are listed in the order of their Kd values and named and colored based on the most similar canonical site, indicating differences from this site with b (bulge), w (G-U wobble), or x (mismatch) followed by the nucleotide and its position. For example, the 8mer-bU(4.6) resembles a canonical 8mer site but has a bulged U at positions that would normally pair to miRNA nucleotides 4, 5, or 6. Everything else is the same as in (D). (F) Relative Kd values for the canonical and the newly identified noncanonical miR-1 sites determined in (E). Sites are classified as either 7- to 8-nt canonical sites (purple), 6-nt canonical sites (cyan), noncanonical sites (pink), or a sequence motif with no clear complementarity to miR-1 (gray). The solid vertical line marks the reference Kd value of 1.0 assigned to reads lacking an annotated site. Error bars indicate 95% confidence interval on the geometric mean, as in (D). (G) The proportion of AGO2–miR-1 bound to each site type. Shown are proportions inferred by the mathematical model over a range of AGO2–miR-1 concentrations spanning the five experimental samples, plotted in the order of site affinity (top to bottom), using the same colors as in (E). On the right is the pairing of each noncanonical site, diagrammed as in (A), indicating Watson-Crick pairing (blue), wobble pairing (cyan), mismatched pairing (red), bulged nucleotides (compressed rendering), and terminal noncomplementarity (gray; B represents C, G, or U; D represents A, G, or U; H represents A, C, or U; V represents A, C, or G). The GCUUCCGC motif is omitted because it did not match miR-1 and did not mediate repression by miR-1 (fig. S5B).

The site-affinity profile of miR-1

As previously implemented, RBNS provides qualitative relative binding measurements for an RNA-binding protein to a virtually exhaustive list of binding sites (17, 18). A purified RNA-binding protein is incubated with a large library of RNA molecules that each contain a central random-sequence region flanked by constant primer-binding regions. After reaching binding equilibrium, the protein is pulled down and any copurifying RNA molecules are reverse transcribed, amplified, and sequenced. To extend RBNS to AGO-miRNA complexes (Fig. 1B), we purified human AGO2 loaded with miR-1 (19) (fig. S1A) and set up five binding reactions, each with a different concentration of AGO2–miR-1 (range of 7.3 to 730 pM, logarithmically spaced) and a constant concentration of an RNA library with a 37-nt random-sequence region (100 nM). We also modified the protein-isolation step of the RBNS protocol, replacing protein pull down with nitrocellulose filter binding, reasoning that the rapid wash step of filter binding would improve retention of low-affinity molecules that would otherwise be lost during the wash steps of a pull down. This modified method was highly reproducible, with high correspondence observed between enrichments for the same 9-nt k-mers (where k-mer is any sequence of length k) in two independent experiments using different preparations of both AGO2–miR-1 and the RNA library (fig. S1B; r2 = 0.86).

When analyzing our AGO-RBNS results, we first examined enrichment of the canonical miR-1 sites, comparing the frequency of these sites in RNA bound in the 7.3 pM AGO2–miR-1 sample with that of the input library. As expected from the site hierarchy observed in meta-analyses of site conservation and endogenous site efficacy (3), the 8mer site (perfect match to miR-1 nucleotides 2 to 8 followed by an A) was most enriched (38-fold), followed by the 7mer-m8 site, then the 7mer-A1 site, and the 6mer site (Fig. 1, A and C). Little if any enrichment was observed for either the 6mer-A1 site or the 6mer-m8 site at this lowest concentration of 7.3 pM AGO2–miR-1 (Fig. 1, A and C), consistent with their weak signal in previous analyses of conservation and efficacy (4, 14, 20). Enrichment of sites was quite uniform across the random-sequence region, which indicated minimal influence from either the primer-binding sequences or supplementary pairing to the 3′ region of the miRNA (fig. S1D). Although sites with supplementary pairing can have enhanced efficacy and affinity (3, 5, 21), the minimal influence of supplementary pairing reflected the rarity of such sites in our library.

Analysis of enrichment of the six canonical sites across all five AGO2–miR-1 concentrations illustrated two hallmarks of this experimental platform (17). First, as the concentration increased from 7.3 to 73 pM, enrichment for each of the six site types increased (Fig. 1D), which was attributable to an increase in signal over a constant low background of library molecules isolated even in the absence of AGO2–miR-1. Second, as the AGO2–miR-1 concentration increased beyond 73 pM, 8mer enrichment decreased, and at the highest AGO2–miR-1 concentration, enrichment of the 7mer-m8 and 7mer-A1 sites decreased (Fig. 1D). These waning enrichments indicated the onset of saturation for these high-affinity sites (17). These two features, driven by AGO-miRNA–independent background and partial saturation of the higher-affinity sites, respectively, caused differences in enrichment values for different site types to be highly dependent on the AGO2–miR-1 concentration; the lower AGO2–miR-1 concentrations provided greater discrimination between the higher-affinity site types, the higher AGO2–miR-1 concentrations provided greater discrimination between the lower-affinity site types, and no single concentration provided results that quantitatively reflected differences in relative binding affinities.

To account for background binding and ligand saturation, we developed a computational strategy that simultaneously incorporated information from all concentrations of an RBNS experiment to calculate relative Kd values. Underlying this strategy was an equilibrium-binding model that predicts the observed enrichment of each site type across the concentration series as a function of the Kd values for each miRNA site type (including the “no-site” type), as well as the stock concentration of purified AGO2–miR-1 and a constant amount of library recovered as background in all samples. Using this model, we performed maximum likelihood estimation (MLE) to fit the relative Kd values, which explained the observed data well (Fig. 1D). Moreover, these relative Kd values were robustly estimated, as indicated by comparing values obtained using results from only four of the five AGO2–miR-1 concentrations (r2 ≥ 0.994 for each of the 10 pairwise comparisons; fig. S1, F and G). These quantitative binding affinities followed the same hierarchy as observed for site enrichment, but the differences in affinities were of greater magnitude (Fig. 1D and fig. S1C).

Up to this point, our analysis was informed by the wealth of previous computational and experimental data showing the importance of a perfect 6- to 8-nt match to the seed region (3). However, the ability to calculate the relative Kd of any k-mer of length ≤12 nt (the 12-nt limit imposed by the sparsity of reads with longer k-mers) provided the opportunity for a de novo search for sites, without bias from any previous knowledge. In this search, we (i) calculated the enrichment of all 10-nt k-mers in the bound RNA in the 730 pM AGO2–miR-1 sample, which was the sample with the most sensitivity for detecting low-affinity sites; (ii) determined the extent of complementarity between the 10 most enriched k-mers and the miR-1 sequence; (iii) assigned a site most consistent with the observed k-mers; and (iv) removed all reads containing this newly identified site from both the bound and input libraries. These four steps were iterated until no 10-nt k-mer remained that was enriched ≥10-fold, thereby generating 14 sites for AGO2–miR-1. We then applied our MLE procedure to calculate relative Kd values for this expanded list of sites (Fig. 1, E and F).

This unbiased approach demonstrated that the 8mer, 7mer-m8, 7mer-A1, and 6mer sites to miR-1 were the highest-affinity site types of lengths ≤10 nt. It also identified eight previously uncharacterized sites with binding affinities resembling those of the 6mer-m8 and the 6mer-A1 (Fig. 1F). Comparison of these sites to the sequence of miR-1 revealed that miR-1 can tolerate either a wobble G at position 6 or a bulged U somewhere between positions 4 and 6 and achieve affinity at least 7- to 11-fold above that of the remaining no-site reads and that it can tolerate either a mismatched C at position 5 or a mismatched U at position 6 and achieve affinity four- to fivefold above that of the no-site reads. The GCUUCCGC motif also passed our cutoffs, which was more difficult to explain, because it had contiguous complementarity to positions 2 to 5 of miR-1 flanked by noncomplementary GC dinucleotides on both sides. Nonetheless, among the 1,398,100 possible motifs ≤10 nt, this was the only one that satisfied our criteria yet was difficult to attribute to miRNA pairing.

Our analytical approach and its underlying biochemical model also allowed us to infer the proportion of AGO2–miR-1 bound to each site (Fig. 1G). The 8mer site occupied 3.8 to 17% of the silencing complex over the concentration course, whereas the 7mer-m8, by virtue of its greater abundance, occupied a somewhat greater fraction of the complex. In aggregate, the marginal sites—including the 6mer-A1, 6mer-m8, and seven noncanonical sites—occupied 6.1 to 9.8% of the AGO2–miR-1 complex. Moreover, because of their very high abundance, library molecules with no identified site occupied 32 to 53% of the complex (Fig. 1G). These results support the inference that the summed contributions of background binding and low-affinity sites to intracellular AGO occupancy are of the same order of magnitude as those of canonical sites, suggesting that an individual AGO-miRNA complex spends about half its time associated with a vast repertoire of background and low-affinity sites (22, 23). This phenomenon would help explain why sequences without recognizable sites often cross-link to AGO in cells.

Our results confirmed that AGO2–miR-1 binds the 8mer, 7mer-m8, 7mer-A1, and 6mer sites most effectively and revealed the relative binding affinities and occupancies of these sites. In addition, our results uncovered weak yet specific affinity to the 6mer-A1 and 6mer-m8 sites plus seven noncanonical sites, all with affinities outside the dynamic range of recent high-throughput imaging experiments (12). Although alternative binding sites for miRNAs have been proposed on the basis of high-throughput in vivo cross-linking studies (2428), our approach provided quantification of the relative strength of these sites without the confounding effects of differential cross-linking efficiencies, potentially enabling their incorporation into a quantitative framework of miRNA targeting.

Distinct canonical and noncanonical binding of different miRNAs

We extended our analysis to five additional miRNAs, including let-7a, miR-7, miR-124, and miR-155 of mammals, chosen for their sequence conservation as well as the availability of data examining their regulatory activities, intracellular binding sites, or in vitro binding affinities (1, 5, 6, 24, 25), and lsy-6 of nematodes, which is thought to bind unusually weakly to its canonical sites (29) (Fig. 2 and fig. S2, B and C). In the case of let-7a, previous biochemical analyses have determined the Kd values of some canonical sites (5, 6, 12), and our values agreed well, which further validated our high-throughput approach (fig. S1H).

Fig. 2 Distinct canonical and noncanonical binding of different miRNAs.

(A to C) Relative Kd values and proportional occupancy of established and newly identified sites of let-7a (A), miR-155 (B), and miR-124 (C). The two miR-124 sites that were present as a 5′-AA–extended form in addition to an unextended form are shown on the same line (C). Relative Kd values are plotted as in Fig. 1F but in some cases with additional categories, either for 3′-only sites (green) [(B) and (C)] or for 6-nt canonical sites enhanced by either additional wobble-pairing or additional Watson-Crick complementarity separated by a bulged nucleotide (blue) [(B) and (C)]. The proportion of AGO2-miRNA bound to each site type is estimated and shown as in Fig. 1G. These analyses also detected a GCACUUUA motif for let-7a and AACGAGGA motif for miR-155, which were assigned relative Kd values of 7.1 ± 0.8 × 10−2 and 6 ± 1 × 10−2, respectively. These motifs are excluded because each did not match its respective miRNA and did not mediate repression by its respective miRNA (fig. S5B).

The site-affinity profile of let-7a resembled that of miR-1, except the 6mer-m8 and 6mer-A1 sites for let-7a had greater binding affinity than essentially all of the noncanonical sites (Fig. 2A). As with miR-1, the noncanonical sites each paired to the seed region but did so imperfectly, typically with a single wobble, single mismatch, or single-nucleotide bulge, but these imperfections differed from those observed for miR-1 (Figs. 1F and 2A).

The site-affinity profiles of miR-124, miR-155, lsy-6, and miR-7 resembled those of miR-1 and let-7a. All but one included the six canonical sites (with miR-7 missing the 6mer-m8 site), and all contained noncanonical sites with extensive yet imperfect pairing to the miRNA seeds, the imperfections tending to occur at different positions and with different mismatched- or bulged-nucleotide identities for different miRNAs (Fig. 2, B and C, and fig. S2, B and C). In contrast to the noncanonical sites of miR-1 and let-7a, more of the noncanonical sites of the other four miRNAs had affinities interspersed with those of the top four canonical sites. Moreover, the profiles for miR-155, miR-124, and lsy-6 also included sites with extended (9- to 11-nt) complementarity to the miRNA 3′ region. These sites had estimated Kd values that were derived from reads with little more than chance complementarity to the miRNA seed, and they had uniform enrichment across the length of the random-sequence region (fig. S1E), which indicated that these sites represented an alternative binding mode dominated by extensive pairing to the 3′ region without involvement of the seed region (Fig. 2, B and C, and fig. S2B). We named them “3′-only sites.”

In some respects, the 3′-only sites resembled noncanonical sites known as centered sites, which are reported to function in mammalian cells (30). Like 3′-only sites, centered sites have extensive perfect pairing to the miRNA, but for centered sites, this pairing begins at miRNA positions 3 or 4 and extends 11 to 12 nt through the center of the miRNA (30). Our unbiased search for sites did not identify centered sites for any of the six miRNAs. We therefore directly queried the region of each miRNA to which extensive noncanonical pairing was favored, determining the affinity of sequences with 11-nt segments of perfect complementarity to the miRNA sequence, scanning from miRNA position 3 to the 3′ end of the miRNA (Fig. 3A). For miR-155, miR-124, and lsy-6, sequences with 11-nt sites that paired to the miRNA 3′ region bound with greater affinity than did those with a canonical 6mer site, whereas for let-7a, miR-1, and miR-7, none of the 11-nt sites conferred stronger binding than did the 6mer. Moreover, for all six miRNAs, the 11-nt sites that satisfied the criteria for annotation as centered sites conferred binding ≤2-fold stronger than that of the 6mer-m8 site, which also starts at position 3 but extends only 6 nt. These results called into question the function of centered sites, although we cannot rule out the possibility that centered sites are recognized by some miRNAs and not others. Indeed, the newly identified 3′-only sites functioned for only miR-155, miR-124, and lsy-6, and even among these, the optimal region of pairing differed, occurring at positions 13 to 23, 9 to 19, and 8 to 18, respectively (Fig. 3A).

Fig. 3 Additional analyses of binding affinities and the correspondence between binding affinity and repression efficacy.

(A) Diverse functionality and position dependence of 11-nt 3′-only sites. Relative Kd values for each potential 11-nt 3′-only site are plotted for the indicated miRNAs (key). For reference, values for the 8mer, 6mer, and 6mer-m8 sites are also plotted. The solid vertical line marks the reference Kd value of 1.0, as in Fig. 1F. The solid and dashed lines indicate geometric mean and 95% confidence interval, respectively, determined as in Fig. 1D. (B) The independent contributions of the A1 and m8 features. On the left, a double-mutant cycle depicts the affinity differences observed among the four top canonical sites for miR-1, as imparted by the independent contributions of the A1 and m8 features and their potential interaction. On the right, the apparent binding contributions of the A1 (∆∆GA1, blue and cyan) or m8 (∆∆Gm8, red and pink) features are plotted, determined from the ratio of relative Kd values of either the 7mer-A1 and the 6mer (blue), the 8mer and the 7mer-m8 (cyan), the 7mer-m8 and the 6mer (red), or the 8mer and the 7mer-A1 (pink) for the indicated AGO2-miRNA complexes. The r2 reports on the degree of ∆∆G similarity for both the m8 and A1 features using either of the relevant site-type pairs across all six complexes. (C) The relationship between the observed relative Kd values and predicted pairing stability of the 6mer (filled circles) and 7mer-m8 (open circles) sites of the indicated AGO-miRNA complex (key), under the assumption that the Kd value for library molecules without a site was 10 nM for all AGO-miRNA complexes. The two black lines are the best fit of the relationship observed for each of the site types (gray regions, 95% confidence interval). The gray line shows the expected relationship with the predicted stabilities given by Kd = e−∆G/RT. (D to I) The relationship between repression efficacy and relative Kd values for the indicated sites of miR-1 (D), let-7a (E), miR-155 (F), miR-124 (G), lsy-6 (H), and miR-7 (I). The number of sites of each type in the 3′UTRs is indicated (parentheses). To include information from mRNAs with multiple sites, multiple linear regression was applied to determine the log fold-change attributable to each site type (error bars, 95% confidence interval). The relative Kd values are those of Figs. 1 and 2 and fig. S2 (error bars, 95% confidence interval). Lines show the best fit to the data, determined by least-squares regression, weighting residuals using the 95% confidence intervals of the log fold-change estimates. The r2 values were calculated using similarly weighted Pearson correlations.

When evaluating other types of noncanonical sites proposed to confer widespread repression in mammalian cells (20, 24), we found that all but two bound with affinities difficult to distinguish from background. One of these two was the 5-nt site matching miRNA positions 2 to 6 (5mer-m2.6) (20), which was bound by miR-1, let-7a, and miR-7 but not by the other three miRNAs (fig. S3). The other was the pivot site (24), which was bound by miR-124 [e.g., 8mer-bG(6.7); Fig. 2C] and lsy-6 [e.g., 8mer-bA(6.7); fig. S2B] but not by the other four miRNAs (fig. S4). Thus, these two previously identified noncanonical site types resembled the newly identified noncanonical sites with extensive yet imperfect pairing to the seed region, in that they function for only a limited number of miRNAs.

In addition to the differences in noncanonical site types observed for each miRNA, we also observed pronounced miRNA-specific differences in the relative affinities of the canonical site types. For example, for miR-155, the affinity of the 7mer-A1 nearly matched that of the 7mer-m8, whereas for miR-124, the affinity of the 7mer-A1 was >9-fold lower than that of the 7mer-m8. These results implied that the relative contributions of the A at target position 1 and the match at target position 8 can substantially differ for different miRNAs. Although prior studies show that AGO proteins remodel the thermodynamic properties of their loaded RNA guides (5, 6), our results show that the sequence of the guide strongly influences the nature of this remodeling, leading to differences in relative affinities across canonical site types and a distinct repertoire of noncanonical site types for each miRNA.

The energetics of canonical binding

With the relative Kd values for the canonical binding sites of six miRNAs in hand, we examined the energetic relationship between the A at target position 1 (A1) and the match at miRNA position 8 (m8), within a framework analogous to a double-mutant cycle (Fig. 3B, left). The apparent binding-energy contributions of the m8 and A1 (∆∆Gm8 and ∆∆GA1, respectively) were largely independent, as inferred from the relative Kd values of the four site types. That is, for each miRNA, the ∆∆Gm8 inferred in the presence of the A1 (using the ratio of the 8mer and 7mer-A1 Kd values) resembled that inferred in the absence of the A1 (using the ratio of the 7mer-m8 and 6mer Kd values), and vice versa (Fig. 3B).

The relative Kd values for canonical sites of six miRNAs provided the opportunity to examine the relationship between the predicted free energy of site pairing and measured site affinities. We focused on the 6mer and 7mer-m8 sites because they lack the A1, which does not pair to the miRNA (Fig. 1A) (8, 31). Consistent with the importance of base pairing for site recognition and the known relationship between predicted seed-pairing stability and repression efficacy (29), affinity increased with increased predicted pairing stability, although this increase was statistically significant for only the 7mer-m8 site type (Fig. 3C; p = 0.09 and 0.005 for the 6mer and 7mer-m8 sites, respectively). However, for both site types, the slope of the relationship was significantly less than expected from Kd = e−∆G/RT, where ∆G is the change in free energy, R is the universal gas constant, and T is temperature (p = 0.008 and 8 × 10−5, respectively). When considered together with the previous analysis of a miRNA with enhanced seed-pairing stability, these results indicated that in remodeling the thermodynamic properties of the loaded miRNAs, AGO not only enhances the affinity of seed-matched interactions but also dampens the intrinsic differences in seed-pairing stabilities that would otherwise impose much greater inequities between the targeting efficacies of different miRNAs (6). Thus, although lsy-6, which has unusually poor predicted seed-pairing stability (29), did indeed have the weakest site-binding affinity of the six miRNAs, the difference between its binding affinity and that of the other miRNAs was less than might have been expected.

Correspondence with repression observed in the cell

To evaluate the relevance of our in vitro binding results to intracellular miRNA-mediated repression, we examined the relationship between the relative Kd measurements and the repression of endogenous mRNAs after miRNA transfection into HeLa cells. When examining intracellular repression attributable to 3′UTR (3′ untranslated region) sites of the transfected miRNA, we observed a pronounced relationship between AGO-RBNS–determined Kd values and mRNA fold changes (Fig. 3, D to I; r2 = 0.80 to 0.97). For instance, the different relative affinities of the 7mer-A1 and 7mer-m8 sites, most extremely observed for sites of miR-155 and miR-124, were nearly perfectly mirrored by the relative efficacy of these sites in mediating repression in the cell (Fig. 3, F and G). A similar correspondence between relative Kd values and repression was observed for the noncanonical sites that had both sufficient affinity and sufficient representation in the HeLa transcriptome to be evaluated using this analysis (Fig. 3, D to I). These included the pivot sites for miR-124 and lsy-6 and the bulge-G7–containing sites for miR-7 (Fig. 3, G to I).

Analysis of mRNA changes observed after miRNA transfection was not suitable for measuring efficacy of the highest-affinity noncanonical sites because these sites lacked sufficient representation in endogenous 3′UTRs. Therefore, we implemented a massively parallel reporter assay designed to examine the efficacy of every site type identified by AGO-RBNS, each in 184 different 3′UTR sequence contexts (fig. S5A). This assay showed that 3′-only sites and other high-affinity–but-rare noncanonical site types do mediate repression in cells and that their efficacies tend to track with their affinities (fig. S5B). In sum, we found a strong correspondence between intracellular repression and in vitro binding affinity, regardless of miRNA identity and regardless of whether the target site is canonical or noncanonical or within an endogenous or a reporter mRNA. This result supported a model in which repression is a function of miRNA occupancy, as dictated by site affinity, and thus miRNA- and site-specific differences in binding affinities explain substantial differences in repression.

The strong influence of flanking dinucleotide sequences

AU-rich nucleotide composition immediately flanking miRNA sites has long been associated with increased site conservation and efficacy in cells (13, 31, 32), but the mechanistic basis of this phenomenon has not been investigated, presumably because of the sparsity of affinity measurements. The AGO-RBNS data provided the means to overcome this limitation. We first separated the miR-1 8mer site into 256 different 12-nt sites, on the basis of the dinucleotide sequences immediately flanking each side of the 8mer, and determined relative Kd values for each (Fig. 4A). This analysis revealed a ~100-fold range in values, depending on the identities of the flanking dinucleotides, with binding affinity strongly tracking the AU content of the flanking dinucleotides. Extending this analysis across all miR-1 site types (Fig. 4B), as well as to sites to the other five miRNAs (fig. S6, A to E), yielded similar results. The effect of the flanking-dinucleotide context was of such magnitude that it often exceeded the affinity differences observed between miRNA-site types. Indeed, for each miRNA, at least one 6-nt canonical site in its most favorable context had greater affinity than that of the 8mer site in its least favorable context (Fig. 4B and fig. S6, A to E).

Fig. 4 The influence of flanking dinucleotide sequence context.

(A) AGO-RBNS profile of miR-1 sites, showing results for the 8mer separated into 256 different 12-nt sites on the basis of the identities of the two dinucleotides immediately flanking the 8mer. For each 12-nt site, the points and line are colored on the basis of the AU content of the flanking dinucleotides (key). For context, results of Fig. 1E are replotted in gray. Everything else is the same as in Fig. 1E. (B) Relative Kd values for each miR-1 site identified in Fig. 1F separated into 144 to 256 sites as in (A) on the basis of the identities of the flanking dinucleotides. The points are colored as in (A). Error bars indicate median 95% confidence interval across all Kd values. Everything else is the same as in Fig. 1F. (C) Consistency of flanking-dinucleotide effect across miRNA and site type. At the left is a comparison of observed relative Kd values and results of a mathematical model that used multiple linear regression to predict the influence of flanking dinucleotides. Plotted are results for all flanking dinucleotide contexts of all six canonical site types, for all six miRNAs, normalized to the average affinity of each canonical site. Predictions of the model are those observed in a sixfold cross-validation, training on the results for five miRNAs and reporting the predictions for the held-out miRNA. The points for five outliers are not shown. The r2 quantifies the agreement between the predicted and actual values, considering all points. On the right, the model coefficients (multiplied by −RT, where T = 310.15 K) corresponding to each of the four nucleotides of the 5′ (5p) and 3′ (3p) dinucleotides in the 5′-to-3′ direction are plotted (error bars, 95% confidence interval). (D) Relationship between the mean structural-accessibility score and the relative Kd for the 256 12-nt sites containing the miR-1 8mer flanked by each of the dinucleotide combinations. Points are colored as in (A). Linear regression (dashed line) and calculation of r2 were performed using log-transformed values. For an analysis of the relationship between 8mer flanking-dinucleotide Kd and structural accessibility over a range of window lengths and positions relative to the 8mer site, see fig. S6G.

To identify general features of the flanking-dinucleotide effect across miRNA sequences and site types, we trained a multiple linear regression model on the complete set of flanking-dinucleotide Kd values corresponding to all six canonical site types of each miRNA, fitting the effects at each of the four positions within the two flanking dinucleotides. The output of the model agreed well with the observed Kd values (Fig. 4C, left; r2 = 0.63), which indicated that the effects of the flanking dinucleotides were largely consistent between miRNAs and between site types of each miRNA. The output of the model also corresponded with the efficacy of intracellular repression, which indicated that these effects on Kd values were consequential in cells (fig. S6F). A and U nucleotides each enhanced affinity, whereas G nucleotides reduced affinity and C nucleotides were intermediate or neutral (Fig. 4C, right). Moreover, the identity of the 5′ flanking dinucleotide, which must come into close proximity with the central RNA-binding channel of AGO (7), contributed more to binding affinity than did the 3′ flanking sequence (Fig. 4C, right).

One explanation for this hierarchy of flanking nucleotide contributions, with A ≈ U > C > G, is that it inversely reflected the propensity of these nucleotides to stabilize RNA secondary structure that could occlude binding of the silencing complex. To investigate this potential role for structural accessibility in influencing binding, we compared the predicted structural accessibility of 8mer sites in the input and bound libraries of the AGO2–miR-1 experiment, using a score for predicted structural accessibility previously optimized on data examining miRNA-mediated repression (14, 33). This score is based on the predicted probability that the 14-nt segment at target positions 1 to 14 is unpaired. We found that predicted accessibilities of sites in the bound libraries were substantially greater than those for sites in the input library and that the difference was greatest for the samples with the lower AGO2–miR-1 concentrations (fig. S6G), as expected if the accessibility score was predictive of site accessibility and if the most accessible sites were the most preferentially bound.

To build on these results, we examined the relationship between predicted structural accessibility and binding affinity for each of the 256 flanking dinucleotide possibilities. For each input read with a miR-1 8mer site, the accessibility score of that site was calculated. The sites were then differentiated on the basis of their flanking dinucleotides into 256 12-nt sites, and the geometric mean of the structural-accessibility scores of each of these extended sites was compared with the AGO-RBNS–derived relative Kd value (Fig. 4D and fig. S6H). A notable correlation was observed (r2 = 0.82, p < 10−15), with all 16 sites containing a 5′-flanking GG dinucleotide having both unusually poor affinities and unusually low accessibility scores. Moreover, sampling reads from the input library to match the predicted accessibility of sites in the bound library recapitulated the flanking dinucleotide preferences observed in the bound library (fig. S6I, r2 = 0.79). Taken together, our results demonstrate that local sequence context has a large influence on miRNA-target binding affinity and indicate that this influence results predominantly from the differential propensities of flanking sequences to favor structures that occlude site accessibility.

A biochemical model predictive of miRNA-mediated repression

Inspired by the finding that measured affinities strongly corresponded to the repression observed in cells (Fig. 3, D to I), we set out to build a biochemical framework that predicts the degree to which a miRNA represses each mRNA. Biochemical principles have been used to model miR-21–directed mRNA slicing (12). However, previous efforts that used biochemical principles to model aspects of the predominant mode of miRNA-mediated repression, including competition between endogenous target sites (23, 34, 35) and the influence of miRNAs on reporter gene–expression noise (36), were severely limited by the sparsity of the data. Our ability to measure the relative binding affinity of a miRNA to any 12-nt sequence enabled modeling of the quantitative effects of the six miRNAs on each cellular mRNA.

We first reanalyzed all six AGO-RBNS experiments to calculate, for each miRNA, the relative Kd values for all 262,144 12-nt k-mers that contained at least four contiguous nucleotides of the canonical 8mer site (Fig. 5A). These potential binding sites included the canonical sites and most of the noncanonical sites that we had identified, each within a diversity of flanking sequence contexts (Figs. 1F and 2). For each mRNA m and transfected miRNA g, the steady-state occupancy Nm,g (i.e., average number of AGO-miRNA complexes loaded with miRNA g bound to mRNA m) was predicted as a function of the Kd values of the potential binding sites contained within the mRNA open reading frame (ORF) and 3′UTR, as well as the concentration of the unbound AGO-miRNAg complex ag, which was fit as a single value for each transfected miRNA (Fig. 5B, equation 1). This occupancy value enabled prediction of a biochemically informed expectation of repression, assuming that the added effect of the miRNA on the basal decay rate scaled with the basal rate and Nm,g (Fig. 5B, equation 2). To isolate the effects of a transfected miRNA over background, we further offset our prediction of repression by a background-binding term (Fig. 5B, Nm,g,background).

Fig. 5 AGO-RBNS Kd values enable a predictive model of miRNA-mediated repression in cells.

(A) The 262,144 12-nt k-mers with at least four contiguous matches to the extended seed region of miR-1, for which relative Kd values were determined. Relative Kd values were similarly determined for the analogous k-mers of the other five miRNAs. (B) Biochemical model for estimating miRNA-mediated repression of an mRNA using the relative Kd values of the 12-nt k-mers in the mRNA. (C) Performance of the biochemical model as evaluated using the combined results of five miRNAs. Plotted is the relationship between mRNA changes observed after transfecting a miRNA and those predicted by the model. Each point represents the mRNA from one gene after transfection of a miRNA and is colored according to the number of canonical sites in the mRNA 3′UTR (key). For easier visual comparison between mRNAs, y-axis points for the same mRNA are adjusted by the extrapolated expression level of the mRNA with no transfected miRNA. The Pearson’s r2 between measured and predicted values is for unadjusted values and is reported in the upper right. (D). Performance of the retrained TargetScan7 model. Everything else is the same as in (C). (E) Performance of the biochemical+ model. Everything else is the same as in (C). (F) Model performances and the contribution of cognate noncanonical sites to performance of the biochemical+ model. Results for each model (key) are plotted for individual miRNAs and for all five miRNAs combined (error bars, standard deviation). (G) Performances of models tested on mRNA changes observed after transfecting let-7c into HCT116 cells engineered to have reduced endogenous miRNA expression (37). This analysis used the average ag fit for the five miRNAs in (F). Everything else is the same as in (F).

The calculation of predicted repression required an estimate of how much a single bound AGO affected the mRNA decay rate (Fig. 5B, b), which was fit as a global value. Additionally, to account for the observation that sites in ORFs are less effective than those in 3′UTRs (3), our model included a penalty term for sites in ORFs, which was also fit as a global value (Fig. 5B). Because no appreciable repression was observed from sites in 5′UTRs, our model did not consider these sites.

Our biochemical model was fit against repression observed in HeLa cells transfected with one of five miRNAs with RBNS-derived measurements (let-7a was excluded because of its high endogenous expression in HeLa cells). A strong correspondence was observed when comparing mRNA changes measured upon miRNA transfection with those predicted by the model (fig. S7A, r2 = 0.30 to 0.37).

The overall performance of our biochemical model (Fig. 5C, r2 = 0.34) exceeded those of the 30 target-prediction algorithms (r2 ≤ 0.14) that were also tested on changes in mRNA levels observed in response to miRNA transfection (14). We reasoned that in addition to our biochemical framework and the use of experimentally measured affinity values, other aspects of our analysis might have contributed to this improvement. For example, the miRNAs chosen for RBNS have high efficacy in transfection experiments, and our RNA-sequencing (RNA-seq) datasets generally had stronger signal over background compared to microarray datasets used to train and test previous target-prediction algorithms. Indeed, when evaluated on the same five datasets, the performance of the latest TargetScan model (TargetScan7) improved from an r2 of 0.14 to an r2 of 0.25 (fig. S7B). To explore the possibility that TargetScan7 might also benefit from training on this type of improved data, we generated transfection datasets for 11 additional miRNAs and retrained TargetScan7 on the collection of 16 miRNA-transfection datasets (again omitting the let-7a dataset), putting aside one dataset each time in a 16-fold cross-validation. Training and testing TargetScan on improved datasets further increased the r2 to 0.28 for the five miRNAs with AGO-RBNS data (Fig. 5D). Nonetheless, the biochemical model still outperformed the retrained TargetScan by >20%, which showed that the use of measured affinity values in a biochemical framework substantially increased prediction performance.

Many features known to correlate with targeting efficacy were captured by our biochemical model. Indeed, the contribution of certain features, such as site type (3), predicted seed-pairing stability (29), and nucleotide identities at specific miRNA or site positions (14), are expected to be represented more accurately in the miRNA-specific Kd values of the 12-nt k-mers than when generalized across miRNAs. However, these Kd values did not fully capture other factors that influence the affinity between miRNAs and their target sites in cells, including the structural accessibility of sites within their larger mRNA contexts and the contribution of supplementary pairing to the miRNA 3′ region, which influences about 5% of sites (3). Without sufficient biochemical data quantifying these effects, we approximated their influence using scoring metrics known to correlate with miRNA targeting efficacy (13, 14) and allowed them to modify the Kd values additively in log space (i.e., linearly in free-energy space). Incorporating each of these metrics slightly improved the performance of the biochemical model, as did incorporating a score for the evolutionary conservation of the site (4), which helped account for additional unknown or imperfectly captured factors that influence targeting efficacy (fig. S7C). Simultaneously incorporating all three metrics to generate what we call the “biochemical+ model” improved the r2 by 9% to 0.37 (Fig. 5E).

To examine how well our models generalized to another cell type and to a miRNA family not used for fitting (let-7), we evaluated them on repression data collected after transfecting let-7c into HCT116 (human colon cancer) cells that had been engineered to not express endogenous miRNAs (37). Although these data had a considerably lower signal-to-noise ratio, which lowered all r2 values, our biochemical models substantially out-performed TargetScan7 (Fig. 5G). This improvement extended to predicting repression after transfecting miR-124 and miR-7 into human embryonic kidney (HEK) 293 cells (38) (fig. S8A). Additional analyses showed that the biochemical+ model performed at least as well as in vivo cross-linking immunoprecipitation sequencing (CLIP-seq) approaches in identifying the mRNAs most repressed upon miRNA transfection or most derepressed upon miRNA knockout (25, 38, 39) (fig. S8, B to D). Furthermore, for individual CLIP clusters enriched in wild type relative to miR-155 knockout, we observed a correlation between the occupancy predicted by our Kd values and the observed enrichment of the cluster [Spearman’s rank-order correlation (rs) = 0.46, p < 10−7; fig. S8E], supporting the conclusion that Kd values measured in vitro reflect intracellular AGO binding.

When provided with Kd values for only the 12-nt k-mers that contained one of the six canonical sites, the biochemical+ model captured somewhat less variance (Fig. 5F, green bars; r2 = 0.35), and conversely when provided with Kd values for only the 12-nt k-mers lacking a canonical site, the model still retained some predictive power (Fig. 5F, purple bars; r2 = 0.06, p < 10−15, likelihood-ratio test). As a control, we repeated the analysis after replacing the noncanonical sites (and their Kd values) of each miRNA with those of another miRNA, performing this shuffling and reanalysis for all 309 possible shuffle permutations. When using each of these shuffled controls, performance decreased, both when considering all sites (Fig. 5F, light blue bars) and when considering only the noncanonical sites (Fig. 5F, pink bars), as expected if the modest improvement conferred by including noncanonical sites were due, at least in part, to miRNA pairing to those sites. This advantage of cognate over shuffled noncanonical sites was largely maintained when evaluating the results for individual miRNAs (Fig. 5F). Together, our results showed that noncanonical sites can mediate intracellular repression but that their impact is dwarfed by that of canonical sites because high-affinity noncanonical sites are not highly abundant within transcript sequences. Thus, the improved performance over TargetScan achieved by the biochemical model was primarily from more accurate modeling of the effects of canonical sites.

CNN for predicting site Kd values from sequence

Our findings that binding preferences differ substantially between miRNAs and that these differences are not well predicted by existing models of RNA duplex stability in solution posed a major challenge for applying our biochemical framework to other miRNAs. Because performing AGO-RBNS for each of the known miRNAs would be impractical, we attempted to predict miRNA-target affinity from sequence using the six sets of relative Kd values and 16 miRNA-transfection datasets already in hand. Bolstered by recent successful applications of deep learning to predict complex aspects of nucleic acid biology from sequence (4043), we chose a CNN for this task.

The overall model had two components. The first was a CNN that predicted relative Kd values for the binding of miRNAs to 12-nt k-mers (fig. S9A), and the second was the previously described biochemical model that links intracellular repression with relative Kd values (Fig. 6A). The training process simultaneously tuned both the neural network weights and the parameters of the biochemical model to fit both the relative Kd values and the mRNA repression data, with the goal of building a CNN that accurately predicts the relative Kd values for all 12-nt k-mers of a miRNA of any sequence.

Fig. 6 A CNN for predicting binding affinity from sequence.

(A) Schematic of overall model architecture for training on RBNS data and transfection data simultaneously. “Loss” refers to squared loss. (B) The relationship between repression efficacy and CNN-predicted relative Kd values for the canonical sites for the 12 test miRNAs. Everything else is the same as in Fig. 3, D to I. (C) The relationship between repression efficacy and RNAduplex-predicted free-energy values (45) (top) or MIRZA scores (27) (bottom) for the canonical sites of the 12 test miRNAs. Everything else is the same as in (B). (D) Performance of the biochemical and biochemical+ models when provided the CNN-predicted relative Kd values and tested on the 12 datasets examining the effects of transfecting miRNAs into HEK293FT cells. On the left are results obtained when considering all mRNAs, and on the right are results obtained when considering mRNAs expressed in HEK293FT cells but not in HeLa cells. Everything else is the same as in Fig. 5F, except shuffling results were for 250 random permutations rather than all possible permutations. (E) Performance of the biochemical+ model on the HEK293FT test set while allowing the ag values to deviate from the optimal fitted values. (F) Relationship between fitted ag and estimated target-site abundance (29) for the guide strands of the 12 duplexes transfected into HEK293FT cells. Points are colored by the average relative Kd value of the 8mer site to each miRNA. The Spearman rs and p value for the relationship are shown.

For the CNN, we chose to include only the first 10 nucleotides of the miRNA sequence, which includes the position 1 nucleotide, the seed region, and the two downstream nucleotides that could pair to a 12-nt k-mer. Because the k-mers were not long enough to include sites with 3′ supplementary pairing, we excluded the 3′ region of the miRNA. Pairs of 10-nt truncated miRNA sequences and 12-nt k-mers were each parameterized as a 10-by-12-by-16 matrix, with the third dimension representing the 16 possible pairs of nucleotides that could be present at each pair of positions in the miRNA and target. The first layer of the CNN was designed to learn important single-nucleotide interactions, the second layer was designed to learn dinucleotide interactions, and the third layer was designed to learn position-specific information.

The training data for the CNN consisted of more than 1.5 million relative Kd values from six AGO-RBNS experiments and 68,112 mRNA expression estimates derived from 4257 transcripts in 16 miRNA transfection experiments. Five miRNAs had data in both sets. Because some repression was attributable to the passenger strands of the transfected duplexes (fig. S9B), the model considered both strands of each transfected duplex, which allowed the neural network to learn from another 16 AGO-loaded guide sequences.

To test how well the CNN-predicted relative Kd values enabled our approach to be generalized to other miRNAs and another cell type, we generated 12 miRNA-transfection datasets in HEK293FT cells, choosing miRNAs that were not appreciably expressed in HEK293 cells (44) and that had not been used in any training (fig. S10). For each miRNA duplex in the test set, the CNN was used to predict relative Kd values for 12-nt k-mers to both the miRNA and passenger strands. As observed with the experimentally derived relative Kd values (Fig. 3, D to I), substantial correspondence was observed between CNN-predicted relative Kd values for the six canonical site types of the transfected miRNAs and the mean repression that these site types conferred in cells (Fig. 6B and fig. S11). This correspondence (r2 = 0.76) substantially exceeded that observed for predictions of RNA-duplex stability in solution (45) and predictions derived from cross-linking results (27) (Fig. 6C; r2 = 0.21 and 0.56, respectively). Aside from accurately predicting the relative efficacy of sites to the same miRNA, the CNN was better able to stratify sites of the same type to different miRNAs (e.g., Fig. 6B, purple dots; r2 = 0.52, p = 0.02). Analysis of other site types suggested that the CNN had some ability to identify effective noncanonical sites for new miRNAs (fig. S11).

When the CNN-predicted Kd values and HeLa-derived global parameters were used as input for the biochemical and biochemical+ models to predict repression of individual mRNAs in HEK293FT cells, the results mirrored those observed when using relative Kd values derived from AGO-RBNS. Median (r2 = 0.21) and overall performance (r2 = 0.18) for the test set both exceeded those of TargetScan (r2 = 0.12 and 0.13, respectively); overall performance improved (r2 = 0.20) when using the biochemical+ model, implying a 50% improvement over TargetScan, and performance dropped slightly when either shuffling or omitting noncanonical sites (Fig. 6D and fig. S12A; the main exception being the results for miR-190a, for which the performance of the biochemical+ model resembled that of TargetScan when only considering the canonical sites but substantially dropped when also considering noncanonical sites). The overall improvement over TargetScan was maintained when focusing on mRNAs that were expressed in HEK293FT cells but not HeLa cells (Fig. 6D). The CNN-predicted relative Kd values also enabled the biochemical+ model to outperform TargetScan and cross-linking approaches in predicting the effects of deleting or adding a miRNA in other cellular contexts (4648) (fig. S12, B to D).

Although our models were improved over previous models, the highest r2 value achieved by our models for any of our datasets was 0.37 (Fig. 5F and fig. S12A), implying that they explained only a minority of the variability in mRNA fold changes occurring upon introducing a miRNA. However, even perfect prediction of the direct effects of miRNAs was not expected to explain all of the variability; some variability was due to the secondary effects of repressing the primary targets, and some was due to experimental noise. To estimate the maximal r2 that could be achieved by predicting the primary effects of miRNA targeting, we attempted to quantify and subtract the fraction of the fold-change variability attributable to the other two causes. For each dataset, the fraction attributable to experimental noise was estimated by examining the reproducibility between replicates in our transfection experiments, and the fraction attributable to secondary effects was inferred by assuming that primary miRNA effects only repress mRNAs, whereas secondary effects affect mRNAs in either direction (with effects distributed log normally). After accounting for these other sources of variability, the biochemical+ model provided with experimentally determined affinity values explained ~60% of the variability attributable to direct targeting (fig. S12E, median of five datasets), and when provided with CNN-predicted values, it explained ~50% of the variability attributable to direct targeting (fig. S12F, median of 12 datasets).

Insights into miRNA targeting

The observation that canonical sites are not necessarily those with the highest affinity raises the question of how canonical sites are distinguished from noncanonical ones and whether making such a distinction is useful. Our results show that two criteria readily distinguish canonical sites from noncanonical ones. First, with only one exception, all six canonical site types were identified for each of the six miRNAs (the exception being the 6mer-m8 site for miR-7), whereas the noncanonical site types were typically identified for only one miRNA and never for more than three. Second, the four highest-affinity canonical sites occupied most of the specifically bound AGO2, even for miR-124, which had the largest and highest-affinity repertoire of noncanonical sites (Figs. 1F and 2 and fig. S2, B and C). This greater role for canonical sites was presumably because perfect pairing to the seed region is the most efficient way to bind the silencing complex; to achieve equivalent affinity, the noncanonical sites must be longer and therefore less abundant. The ubiquitous function and more efficient binding of canonical sites explains why these site types have the greatest signal in meta-analyses of site conservation, thereby explaining why they were the first site types to be identified (31) and justifying the continued distinction between canonical and noncanonical site types.

The potential role of pairing to miRNA nucleotides 9 and 10 has been controversial. Although some target-prediction algorithms (such as TargetScan) do not reward pairing to these nucleotides, most algorithms assume that such pairing enhances site affinity. Likewise, although one biochemical study reports that pairing to position 9 reduces site affinity (6), another reports that it increases affinity (12). We found that extending pairing to nucleotide 9 or 10 neither enhanced nor diminished affinity in the context of seed-matched sites (Fig. 4), whereas extending pairing to nucleotide 9 or 10 enhanced affinity in the context of 3′-only sites (Fig. 2, C and D). These results support the idea that extensive pairing to the miRNA 3′ region unlocks productive pairing to nucleotides 9 to 12, which is otherwise inaccessible (1).

The biochemical parameters fit by our model provided additional insights into miRNA targeting. In the framework of our model, the fitted value of 1.8 observed for the parameter b suggested that a typical mRNA bound to an average of one silencing complex will experience a near tripling of its decay rate, which would lead to a ~60% reduction in its abundance. In the concentration regimes of our transfection experiments, this occupancy can be achieved with two to three median 7mer-m8 sites. In addition, our fitted value for the ORF-site penalty suggested that the translation machinery reduces site affinity by 5.5-fold.

Another parameter was ag, that is, the intracellular concentration of AGO loaded with the transfected miRNA and not bound to a target site. Whereas values of the other parameters could be fit globally in HeLa cells and then used for testing, ag was fit separately for each miRNA and passenger strand of each transfection experiment. Nonetheless, when ag values were allowed to deviate from the fitted values, the biochemical+ model still outperformed TargetScan in predicting test-set repression over a 100-fold range of values (Fig. 6E), which indicated that even with rough estimates of miRNA abundances, our modeling framework had an advantage over other predictive methods in new contexts. Information that might be used to more accurately estimate ag values should come with the determination of these values for more miRNAs in more cellular contexts, together with the observation that, as expected (29, 49), fitted ag values are higher for miRNAs with lower predicted target abundance and lower general affinity for their targets (Fig. 6F).

Our work replaced the correlative models of targeting efficacy with a principled biochemical model that explains and predicts about half of the variability attributable to the direct effects of miRNAs on their targets, raising the question of how the understanding and prediction of miRNA-mediated repression might be further improved. Acquiring site-affinity profiles for additional miRNAs with diverse sequences will improve the CNN-predicted miRNA-mRNA affinity landscape and further flesh out the two major sources of targeting variability revealed by our study, that is, the widespread differences in site preferences observed for different miRNAs and the substantial influence of local (12-nt) site context. We suspect additional improvement will come with increased ability to predict the other major cause of targeting variability, which is the variability imparted by mRNA features more distant from the site. This variability is captured only partially by the three features added to the biochemical model to generate the biochemical+ model. Perhaps the most promising strategy for accounting for these more distal features will be an unbiased machine-learning approach that uses entire mRNA sequences to predict repression, leveraging substantially expanded repression datasets as well as site-affinity values. In this way, the complete regulatory landscape, as specified by AGO within this essential biological pathway, might ultimately be computationally reconstructed.

Methods summary

AGO2-miRNA complexes were generated by adding synthetic miRNA duplexes to lysate from cells that overexpressed recombinant AGO2, and then these complexes were purified on the basis of affinity to the miRNA seed. RNA libraries were generated by in vitro transcription of synthetic DNA templates. For AGO-RBNS, purified AGO2-miRNA complex was incubated with a large excess of library molecules, and after reaching binding equilibrium, library molecules bound to AGO2-miRNA complex were isolated and prepared for high-throughput sequencing. Examination of k-mers enriched within the bound library sequences identified miRNA target sites, and relative Kd values for each of these sites were simultaneously determined by maximum likelihood estimation, fitting to AGO-RBNS results obtained over a 100-fold range in AGO2-miRNA concentration.

Intracellular miRNA-mediated repression was measured by performing RNA-seq on HeLa cells that had been transfected with a synthetic miRNA duplex. For sites that were sufficiently abundant in endogenous 3′UTRs, efficacy was measured on the basis of their influence on levels of endogenous mRNAs of HeLa cells. Site efficacy was also evaluated using massively parallel reporter assays, which provided information for the rare sites as well as the more abundant ones. The biochemical and biochemical+ models of miRNA-mediated repression were constructed and fit using the measured Kd values, and the repression of endogenous mRNAs was observed after transfecting miRNAs into HeLa cells. The CNN was built using TensorFlow, trained using the measured Kd values and the repression observed in the HeLa transfection experiments, and tested on the repression of endogenous mRNAs observed after transfecting miRNAs into HEK293T cells. Results were also tested on external datasets examining either intracellular binding of miRNAs by CLIP-seq or repression of endogenous mRNAs after miRNAs had been transfected, knocked down, or knocked out. The details of each of these methods are described in the supplementary materials.

Supplementary Materials

science.sciencemag.org/content/366/6472/eaav1741/suppl/DC1

Materials and Methods

Figs. S1 to S12

Tables S1 and S2

References (5055)

Data S1 to S3

References and Notes

Acknowledgments: We thank K. Heindl, T. Eisen, and T. Bepler for helpful discussions; Y. Zhou for providing processed CLIP data from miR-20a overexpression; and members of the Bartel lab for comments on this manuscript. Funding: This work was supported by NIH grants GM118135 (D.P.B.) and GM123719 (N.B.). D.P.B. is an investigator of the Howard Hughes Medical Institute. Author contributions: S.E.M. developed AGO-RBNS and associated analyses, which he implemented with help from T.M.P. and N.B. K.S.L. devised and implemented the biochemical model and CNN. C.Y.S., G.M.K., and T.M.P. performed transfection and sequencing experiments. C.Y.S. and S.E.M. designed and performed the massively parallel reporter assay. S.E.M., K.S.L., and D.P.B. designed the study and wrote the manuscript with input from other authors. Competing interests: The authors declare no competing interests. Data and materials availability: Sequencing data are available in the Gene Expression Omnibus (accession number GSE140220), and computational tools are deposited in GitHub (https://github.com/smcgeary/agorbns and https://github.com/kslin/miRNA_models).

Stay Connected to Science

Navigate This Article