Coding-Sequence Determinants of Gene Expression in Escherichia coli

See allHide authors and affiliations

Science  10 Apr 2009:
Vol. 324, Issue 5924, pp. 255-258
DOI: 10.1126/science.1170160


Synonymous mutations do not alter the encoded protein, but they can influence gene expression. To investigate how, we engineered a synthetic library of 154 genes that varied randomly at synonymous sites, but all encoded the same green fluorescent protein (GFP). When expressed in Escherichia coli, GFP protein levels varied 250-fold across the library. GFP messenger RNA (mRNA) levels, mRNA degradation patterns, and bacterial growth rates also varied, but codon bias did not correlate with gene expression. Rather, the stability of mRNA folding near the ribosomal binding site explained more than half the variation in protein levels. In our analysis, mRNA folding and associated rates of translation initiation play a predominant role in shaping expression levels of individual genes, whereas codon bias influences global translation efficiency and cellular fitness.

The theory of codon bias posits that preferred codons correlate with the abundances of iso-accepting tRNAs (1, 2) and thereby increase translational efficiency (3) and accuracy (4). Recent experiments have revealed other effects of silent mutations (57). We synthesized a library of green fluorescent protein (GFP) genes that varied randomly in their codon usage, but encoded the same amino acid sequence (8). By placing these constructs in identical regulatory contexts and measuring their expression, we isolated the effects of synonymous variation on gene expression.

The GFP gene consists of 240 codons. For 226 of these codons, we introduced random silent mutations in the third base position, while keeping the first and second positions constant (Fig. 1A). The resulting synthetic GFP constructs differed by up to 180 silent substitutions, with an average of 114 substitutions between pairs of constructs (Fig. 1B and figs. S1 and S2). The range of third-position GC content (GC3) across the library of constructs encompassed virtually all (99%) of the GC3 values among endogenous Escherichia coli genes, and the variation in the codon adaptation index (CAI) (9) contained most (96%) of the CAI values of E. coli genes (Fig. 1).

Fig. 1.

Synthetic library of GFP genes with randomized codon usage. (A) Degenerate oligonucleotides were mixed and assembled by polymerase chain reaction. Fragments were then cloned, sequenced, and assembled into complete GFP genes. Red indicates third-codon positions. Degenerate symbols are as follows: D (A or G or T); H (A or C or T); N (A or C or G or T); R (A or G); and Y (C or T). (B) Example alignment illustrating sequence diversity among 15 synthetic genes. Shaded boxes indicate first and second codon positions, which are conserved across the library. (C and D) The distribution of GC3 and CAI among the 154 synthetic GFP genes (C) is representative of the diversity among the 4288 endogenous E. coli genes (D).

We expressed the GFP genes in E. coli using a T7-promoter vector, and we quantified expression by spectrofluorometry. Fluorescence levels varied 250-fold across the library, and they were highly reproducible for each GFP construct (Spearman r = 0.98 between biological replicates) (fig. S3). Fluorescence variation was consistent across a broad range of experimental conditions (fig. S4). An alternative plasmid with bacterial promoter reduced overall expression levels, but the correlation between the two expression systems remained high (r = 0.9) (fig. S4). A similar pattern of fluorescence variation was observed in fluorescence-activated cell sorting measurements (fig. S5). Because the encoded protein sequence was identical for all genes, we attributed fluorescence variation to differences in protein levels. This was confirmed by strong correlations between fluorescence and total GFP levels in Western blots (fig. S5) and Coomassie staining (r = 0.9, P < 10–15).

To test the theory that E. coli translation rates and eventual protein levels depend on the concordance between codon usage and cellular tRNA abundances (1012), we compared codon usage to fluorescence among the 154 synonymous GFP variants. Notably, neither of the two most common measures of codon bias, the CAI or the frequency of optimal codons (3), was significantly correlated with fluorescence levels (r = 0.14, P = 0.09, and r = 0.11, P = 0.16, respectively) (Fig. 2A). Moreover, some of the most highly expressed genes featured low CAI and vice versa.

Fig. 2.

The determinants of gene expression. (A) Codon adaptation was not significantly correlated with fluorescence among the 154 GFP constructs (r = 0.14, P = 0.09). (B) Predicted 5′ mRNA folding energy was strongly correlated with fluorescence (r = 0.66, P <10–15). For each construct, folding energy was calculated in a window spanning positions –4 to +37 relative to translation start; two sample structures are shown. (C) Sliding window analysis of mRNA folding and fluorescence. Local mRNA folding energies were calculated in a sliding window of length 42 nt. The significance of the correlation between local folding energy and fluorescence (negative log10 P value) is plotted as a function of window position along the sequence. Note the overlapping locations of the 30-nt ribosome–binding site (blue bar) and the window of strongest correlation between folding energy and fluorescence (partially overlapping red bar, nt –4 through nt +37).

Although codon adaptation near the 5′ terminus is considered particularly important for expression (12, 13), the CAI value of the first 42 bases in a GFP gene was not significantly correlated with the gene's fluorescence intensity (r = 0.1, P = 0.2). Similarly, the number of rare codons (sites with CAI < 0.1) in a sequence was not significantly correlated with fluorescence (r = –0.02, P = 0.7), and neither was the number of pairs of consecutive rare codons (r = –0.14, P = 0.09). Although specific consecutive codon pairs have been proposed to influence translation (14, 15), the frequency of such rare pairs in a gene was not significantly correlated with its fluorescence (r = 0.07, P = 0.35) (8).

Statistical analyses of which nucleotide positions influenced gene expression (fig. S6) indicated the importance of local sequence patterns, as opposed to global codon bias. This pattern is consistent with studies of base content (16, 17), which suggest that mRNA structure may shape expression levels (1821). Therefore, for each GFP construct, we computed the predicted minimum free energy associated with the secondary structure of its entire mRNA or specific regions of its mRNA. The folding energy of the entire mRNA was not significantly correlated with fluorescence (r = 0.16, P = 0.051), but the folding energy of the first third of the mRNA was strongly correlated: mRNAs with stronger structure produced lower fluorescence (r = 0.60, P <10–15). A moving window analysis identified a region, from nucleotide (nt) –4 to +37 relative to start, for which predicted folding energy explained 44% of the variation in fluorescence levels across the GFP library (r = 0.66, P <10–15) (Fig. 2B). The same folding energies explained 59% of fluorescence variation when constructs were expressed using a bacterial promoter (r = 0.77, P <4×10–16) (fig. S7). mRNA folding also correlated with fluorescence in a separate analysis of GFP constructs differing by single mutations (8).

The strong correlation between mRNA folding and fluorescence suggests the simple mechanistic explanation that tightly folded messages obstruct translation initiation and thereby reduce protein synthesis (22). Predicted mRNA structures for highly expressed GFPs characteristically contained many unpaired nucleotides near the start codon, whereas constructs expressed at low levels featured long hairpin loops (Fig. 2B and fig. S8), consistent with known obstructions to initiation (22). The region of strongest correlation between folding energy and expression did not overlap with the Shine-Dalgarno (SD) sequence, which suggested that SD occlusion by secondary structure (22, 23) did not play a major role in inhibiting expression, probably because our constructs contained no noncoding mutations. By contrast, the region of strongest effect overlapped significantly with the 30-nt ribosome–binding site centered around the start codon (Fig. 2C).

In a multiple regression, mRNA folding energy near the start codon (nt –4 through +37) explained nearly 10 times as much variation in expression levels as any other predictor variable, including the global GC content, CAI, the number of rare-codon sites or consecutive pairs, the length of the longest rare-codon stretch, the number of predicted transcription termination signals, the propensity for conformation changes into Z-DNA, and the number of predicted ribonuclease (RNase) E cleavage sites (8). RNase E cleavage sites tended to reduce expression, as expected (24), and explained 4.7% of fluorescence variation.

Although global GC content was not significantly correlated with fluorescence (r = –0.031, P = 0.7), GC content near the start codon was strongly correlated. But this was likely mediated by mRNA secondary structure; GC content was itself correlated with folding energy, and folding energy explained 10 times as much variation in fluorescence as was explained by GC content (8).

GFP mRNA levels, as quantified by Northern blotting, varied across the library, but the extent of mRNA variation was three times smaller than that of corresponding fluorescence variation. We also observed 3′-truncated mRNA species that differed among GFP variants, which likely reflected different stabilities of mRNA degradation intermediates (fig. S9). mRNA levels were highly correlated with fluorescence (r = 0.53) and also with folding energy near the start codon (r = 0.33). These relations are consistent with the hypothesis that secondary structure influences both mRNA and protein levels through occlusion of ribosome subunit binding. Reduced ribosome binding increases mRNA exposure to nuclease digestion, which in turn decreases stability (25).

Bacterial growth rates were strongly influenced by the codon usage of the expressed GFP construct (8). Elevated CAI was correlated with faster growth (r = 0.54, P <9× 10–13), whereas 5′ mRNA folding energy showed no significant correlation with growth (r = 0.12, P = 0.15). These results support the hypothesis that low codon adaptation in an overexpressed gene decreases cellular fitness (16), probably because retarded elongation sequesters ribosomes on the GFP mRNA and thereby hinders translation of essential mRNAs. The growth rate data could alternatively be explained by the hypothesis that high codon adaptation reduces the rate of deleterious protein misfolding (6, 26, 27). Although we do not rule out this possibility, in our experiments CAI was not correlated with the degree of misfolding, whether it was quantified by the ratio of Coomassie to fluorescence or by the ratio of mRNA to fluorescence (8).

Our findings lead to the following prediction: Adding a stretch of codons with weak mRNA structure to the 5′ end of a gene with originally strong structure should increase expression, even if the additional codons have low CAI. To test this prediction, we fused a 28-codon tag to the 5′ terminus of 72 GFP constructs. The tagged constructs, which featured weak mRNA secondary structure and low CAI (8), produced consistently high expression, including those GFPs poorly expressed in nontagged form (Fig. 3). These results suggest that endogenous E. coli genes may have undergone selection for weak 5′ secondary structure. Consistent with this hypothesis, we found that the predicted secondary structures for the 4294 E. coli genes are significantly weaker near their start codons (nt –4 to +37) than immediately downstream (nt +38 to +79; Wilcoxon P <10–15).

Fig. 3.

Expression levels of alternative GFP constructs. The distribution of log2 normalized fluorescence levels for (top) pGK8 (T7 promoter, no leader sequence), (middle) pGK14 (PBAD bacterial promoter, no leader sequence) and (bottom) pGK16 (trp-lac bacterial promoter, 28-codon leader sequence) expression vectors. Fluorescence varied substantially when expressed using T7 or bacterial promoter. The addition of a 28-codon leader sequence with low secondary structure produced uniformly high expression levels.

Here, we have systematically quantified the effects of synonymous nucleotide variation on gene expression in E. coli, on the basis of unbiased sequences that control for regulatory context. The data reveal a predominant role for mRNA structure around the ribosomal binding site in shaping mRNA and protein levels. By contrast, neither local nor global codon bias had significant effects on mRNA or protein levels. This finding is consistent with the view that translation initiation, not elongation, is rate-limiting for gene expression (28), but it seems to contradict the well-known correspondence between codon bias and expression level for endogenous genes (11, 29). There is a simple explanation to this apparent contradiction, which reverses the arrow of causality between codon adaptation and gene expression. In one view, high CAI induces strong protein expression (1012), whereas we argue that strong expression induces selection for high CAI. Unlike genome-wide correlations between CAI and expression levels [e.g. (11)], our analyses control for noncoding regulation and, thus, can distinguish between these two alternatives.

We propose that the correspondence between codon adaptation and expression level among endogenous E. coli genes arises from selection to make translation efficient at a global level, rather than at the level of individual genes. High CAI increases the elongation rate, but because initiation is rate-limiting in translation, elongation rate does not significantly affect expression. On the other hand, rapid elongation sequesters fewer ribosomes on the message, thereby increasing the total rate of protein synthesis and accelerating cell growth. A similar model for codon preference has been proposed by Andersson and Kurland (16). Well-adapted codons could also confer a metabolic advantage by reducing the load of misfolded proteins (26, 27). In either case, increasing a gene's codon adaptation should not increase its expression. High codon adaptation in a gene should, however, improve cellular fitness to an extent that depends on its expression level.

Supporting Online Material

Materials and Methods

Figs. S1 to S9


List of Oligonucleotides

References and Notes

View Abstract

Navigate This Article