Causes and Effects of N-Terminal Codon Bias in Bacterial Genes

See allHide authors and affiliations

Science  25 Oct 2013:
Vol. 342, Issue 6157, pp. 475-479
DOI: 10.1126/science.1241934

Exploiting Redundancy

The genetic code is redundant—multiple codons can code for the same amino acid. So-called synonymous codon changes within genes can nonetheless have substantial affects on protein expression, which have been attributed to changes in the structure of 5′ messenger RNAs, among other factors. Goodman et al. (p. 475, published online 26 September) built and measured the expression of a synthetic library of 14,000 variant N-terminal sequences of 137 Escherichia coli genes to show that, unexpectedly, rare codons had a bigger effect on increasing protein expression than more common codons. Increased RNA structure downstream of translation initiation appeared to represent the major determinant of expression differences owing to codon usage.


Most amino acids are encoded by multiple codons, and codon choice has strong effects on protein expression. Rare codons are enriched at the N terminus of genes in most organisms, although the causes and effects of this bias are unclear. Here, we measure expression from >14,000 synthetic reporters in Escherichia coli and show that using N-terminal rare codons instead of common ones increases expression by ~14-fold (median 4-fold). We quantify how individual N-terminal codons affect expression and show that these effects shape the sequence of natural genes. Finally, we demonstrate that reduced RNA structure and not codon rarity itself is responsible for expression increases. Our observations resolve controversies over the roles of N-terminal codon bias and suggest a straightforward method for optimizing heterologous gene expression in bacteria.

Codon usage is biased in natural genes and can strongly affect heterologous expression (1). Many organisms are enriched for poorly adapted codons at the N terminus of genes (25). Several studies suggest that these codons slow ribosomal elongation during initiation and lead to increased translational efficiency (2, 4, 6). Most organisms also display reduced mRNA secondary structure at the N terminus (7), and studies using synthetic codon gene variants have resulted in conflicting theories on which mechanisms are causal for expression changes (7, 8). Information about the causes and effects of codon bias has been restricted to relations inferred from natural sequences using genome-wide correlation (2, 3, 5, 9, 10), conservation among species (4), or relatively small libraries of synthetic genes with synonymous codon changes (3, 8, 1115). Here, we separate and quantify the factors controlling expression at the N terminus of genes in Escherichia coli by building and measuring expression from a large synthetic library of defined sequences.

We used array-based oligonucleotide libraries (16) to generate 14,234 combinations of promoters, ribosome binding sites (RBSs), and 11 N-terminal codons in front of super-folder green fluorescent protein (sfGFP) on a plasmid that constitutively coexpresses mCherry (fig. S1) (1719). The sequences for the N-terminal peptides correspond to the first 11 amino acids (including the initiating methionine) of 137 endogenous E. coli essential genes (20) that utilize the entire codon repertoire (fig. S2). We expressed these sfGFP fusions from two promoters and three RBSs of varying strengths (19). We also included the natural RBS for each endogenous gene. For each combination of promoter, RBS, and peptide sequence, we designed a set of 13 codon variants to represent a wide range of codon usages and secondary structure free energies across the translation initiation region. We studied the interactions between the 5′ untranslated region (UTR) and N-terminal codon usage because initiation is thought to be the rate-limiting step for translation (1), this region has been previously implicated in determining most expression variation (8), N-terminal codons are more highly conserved (21), and rare codons are enriched at the N terminus of natural genes and especially those that are highly expressed (2).

We measured DNA, RNA, and protein levels from the entire library using a multiplex assay (Fig. 1C and figs. S3 and S4) (19). DNA and RNA levels were determined using DNA sequencing (DNASeq) and RNASeq. Protein levels were determined by FlowSeq; 7327 (51.5%) constructs were within the quantitative range of our assay [coefficient of determination (R2) = 0.955, P < 2 × 10−16] (fig. S5). We normalized the expression measurements across each 13-member codon variant set as fold change from log-average to control for changes in promoters, RBSs, and peptide sequence (fig. S6).

Fig. 1 Gene expression measurements of the reporter library.

(A) N-terminal peptide sequences encoding the most rare (R) codon variants show increased expression when compared to the most common ones (C). (B) Fold change in expression between C and R codon variants is largely independent of RBS strength. WT, wild type. (C) Protein expression of the library (as measured by the sfGFP:mCherry ratio) covers a ~200-fold range. The 13-member codon variant sets are grouped into columns by promoter/RBS combination (top). Codon variants include C, R, wild-type sequence (wt), and 10 sequences with varying secondary structure (∆G). Not shown are two additional low-promoter panels, which were mostly outside the quantitative FlowSeq range. Dark gray squares had insufficient data, and light gray squares correspond to duplicate constructs.

Changing synonymous codon usage in the 11–amino acid N-terminal peptide resulted in a mean 60-fold increase in protein abundance from the weakest to strongest codon variant even though >96% of the gene remained unchanged. For over 160 codon variant sets (25% of sets within range), the difference was >100-fold. For each codon variant set, we included sequences encoding the most common or rare synonymous codon in E. coli for every amino acid. The rare codon constructs displayed a mean 14-fold (median 4-fold) increase in protein abundance compared with common codon constructs (Fig. 1A) (P < 2 × 10−16, two-tailed t test) even though common codons are generally thought to increase protein expression and fitness (1, 9, 22, 23).

To understand why rare codons cause increased expression, we first examined several codon usage metrics, but they could only explain <5% of expression differences (fig. S7A). New metrics that take into account both transfer RNA (tRNA) availability and usage [normalized translational efficiency (nTE)] show stronger N-terminal enrichment (4). We calculated nTE scores for E. coli and found that nTE scores that were similar to the tRNA adaptation index (tAI) (R2 = 0.847, P < 2 × 10−16) did not correlate well with N-terminal codon enrichment in the E. coli genome (R2 = 0.107, P = 0.00654), and did not significantly correlate with codons that increased protein expression in our data set (R2 = 0.024, P = 0.124). Others have proposed that slow ribosome progression at the N terminus due to rare codons increases translational efficiency (2, 13, 14). This “codon ramp” hypothesis should apply primarily in the context of strong translation, but we found that using rare codons at the N terminus increases expression regardless of translation strength (Fig. 1B). Finally, ribosome occupancy profiling in E. coli has shown that tRNA abundance does not correlate to translation rate but that specific rare codons can create internal Shine-Dalgarno–like motifs that can alter translational efficiency (6). We looked for an association between the presence of internal Shine-Dalgarno–like motifs and changes in expression, and found it to be weak but statistically significant (R2 = 0.002, P < 1.3 × 10−5).

We built a simple linear regression model correlating the use of each individual synonymous codon with expression changes (Fig. 2A and fig. S8). For most amino acids, we found a link between the rarity of the codon and increased expression (Fig. 2B). There is a strong correlation between codons that affected expression and their relative N-terminal enrichment in E. coli (R2 = 0.73, P < 2.3 × 10−9) (Fig. 2C). Using relative translation efficiency instead of relative expression produced similar results (fig. S9).

Fig. 2 Rare codons generally increase expression levels.

(A) The average fold change in expression is correlated with the choice of codon. The y axis is the slope of a linear model linking codon use to expression change. Codons are sorted left to right by increasing genomic frequency and colored according to their relative synonymous codon usage (RSCU) in E. coli. (P values after Bonferroni correction: *P < 0.05, **P < 0.005, ***P < 0.001). (B) The individual codon slopes (y axis) as in (A) show an inverse relationship with RSCU (x axis). (C) The individual codon slopes correlate with enrichment of codons at the N terminus of genes in E. coli.

Decreased GC-content correlated with increased protein expression (R2 = 0.12, P < 2 × 10−16) (Fig. 3A). Rare codons in E. coli are frequently A/T-rich at the third position, and codons ending in A/T more frequently correlate with increased expression than synonymous codons ending in G/C. (fig. S10). This association suggested a link to mRNA transcript secondary structure (8), and so we computationally predicted RNA structure over the first 120 bases of each transcript using NUPACK (24). We found that increased secondary structure was correlated with decreased expression, which explained more variation than any other variable we measured (R2 = 0.34, P < 2 × 10−16) (Fig. 3B). We made a similar linear regression model relating individual codon substitution to change in secondary structure free energy, rather than expression levels, and found a strong correlation between codons that decreased secondary structure and those that increased protein expression (R2 = 0.87, P < 2 × 10−16) (Fig. 3C). In addition, codon adaptation metrics at the N terminus correlate as well to change in secondary structure free energy as they do to change in protein expression (fig. S7B).

Fig. 3 Rare codons alter expression by reducing mRNA secondary structure.

(A) Expression changes are correlated with relative changes in %GC content. Each boxplot includes ±2% of centered value. (B) Expression increases correlate to relative increases in free energy of folding at the front of the transcript (ΔΔG). Each boxplot includes ±2 kcal/mol of centered value. (C) Individual codon slopes (same as Fig. 2A y axis) correlate with the ΔΔG per individual codon substitution. (D) After controlling for ΔΔG with a multiple linear regression, there is no longer any relation between individual codon slopes and RSCU (compare with Fig. 2B). (E) The ΔΔG versus change in tAI is plotted for all constructs within the quantitative range. Constructs are colored by their relative fold change in expression from the average codon variant within the set. (F) Subsets of constructs corresponding to the shaded boxes in (E). (Left) Points with constant codon adaptation and varied secondary structure, (right) points with constant secondary structure and varied codon adaptation.

We used multiple regression to control for the secondary structure changes between codon variants and found that no relation remained between N-terminal codon adaptation and increased expression (R2 = 0.05, P = 0.197) (Fig. 3D). Additionally, constructs with constant tAI still show a correlation between expression and secondary structure, but constructs with constant secondary structure have no correlation between tAI and expression. (Fig. 3, E and F). Finally, if secondary structure is the dominant factor, we would expect a disproportionate enrichment of A over T due to G-U wobble pairing. Indeed, nucleotide triplets with A at the wobble position were more consistently correlated with expression in our data set and with enrichment at the N terminus of E. coli genes (fig. S11).

Kudla et al. show that local RNA structure in the region between –4 and +38 of translation start is most correlated with expression change (8). Our data indicate that the region centered on +10 is most correlated with expression changes (Fig. 4 and figs. S12 to S14), which closely matches in vitro translation studies (25). This region remained the most correlated for the subset of constructs with no change in total free energy of folding across the N-terminal region (figs. S15 and S16). Although secondary structure is known to affect the RBS (26), when only codon usage is altered, RNA structure after the start codon, and not at the RBS, is the major contributor to expression differences. A multiple linear regression model that combines promoter and RBS choice, as well as N-terminal secondary structure and GC content, still explains only 54% of variation in expression levels. Amino acid composition effects on sfGFP folding and inadequacies in computational RNA structure prediction could be partially responsible. However, there are likely additional effects left to uncover, and the extent to which codon usage beyond the N-terminal region alters gene expression remains unresolved (8, 14).

Fig. 4 mRNA structure downstream of start codon is most correlated with reduced expression.

Relative hybridization probabilities averaged in 10-nuclotide windows are plotted against their correlation with expression change as a function of position (–20 to +60 from ATG). (Top) The best and worst 5% of constructs—as ranked by relative expression within a codon variant set—are grouped and plotted as blue and red ribbons, respectively. The ribbon tops and bottoms are one standard deviation from the mean, which is shown as a solid line. (Bottom) The P value for linear regressions, correlating hybridization probabilities within each window to expression fold change in all constructs.

The N terminus of genes in almost all bacteria displays reduced secondary structure, but enrichment of poorly adapted N-terminal codons is only found in bacteria with GC content of at least 50% (3). Recent work further shows that AT-rich codons as opposed to rare codons themselves are preferentially selected, which implicates secondary structure as the driving force for N-terminal codon selection in most bacteria (5). Despite mechanistic differences in translation between prokaryotes and eukaryotes, both single- and multicell eukaryotes also have reduced N-terminal secondary structure (7). For synthetic GFP templates in yeast, secondary structure is more correlated with expression changes than codon adaptation metrics (10). Here, we do not examine other factors that might shape natural sequence, such as codon pair bias (1, 27), cotranslational folding (4, 12, 28), or growth conditions (11, 15). Natural genomic sequence is often not suited to distinguish between conflicting hypotheses of how sequence affects function; multiplexed assays of large synthetic DNA libraries provide a powerful method to examine such hypotheses in a controlled manner.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S16

Table S1

References (2935)

References and Notes

  1. Acknowledgments: We thank J. C. Way, E. R. Daugharthy, and R. T. Sauer for comments. The research was supported by the U.S. Department of Energy (DE-FG02-02ER63445 to G.M.C.), NSF SynBERC (SA5283-11210 to G.M.C.), Office of Naval Research (N000141010144 to G.M.C. and S.K.), Agilent Technologies, Wyss Institute, and an NSF Graduate Research Fellowship to D.B.G. Data can be accessed on the National Center for Biotechnology Information, NIH, Sequence Read Archive (SRA) (SRP029609). pGERC reporter can be obtained from AddGene (#47441). Accession numbers: The Project accession at the SRA is SRP029609. The sample accession is SRS477429. There are three experiments, one for DNA, one for RNA, one for FlowSeq: RNA, SRX346948; DNA, SRX346944; and FlowSeq, SRX346268.
View Abstract

Stay Connected to Science

Navigate This Article