PerspectiveGene Expression

Statistics requantitates the central dogma

See allHide authors and affiliations

Science  06 Mar 2015:
Vol. 347, Issue 6226, pp. 1066-1067
DOI: 10.1126/science.aaa8332

Mammalian proteins are expressed at ∼103 to 108 molecules per cell (1). Differences between cell types, between normal and disease states, and between individuals are largely defined by changes in the abundance of proteins, which are in turn determined by rates of transcription, messenger RNA (mRNA) degradation, translation, and protein degradation. If the rates for one of these steps differ much more than the rates of the other three, that step would be dominant in defining the variation in protein expression. Over the past decade, system-wide studies have claimed that in animals, differences in translation rates predominate (25). On 10.1126/science.1259038 of this issue, Jovanovic et al. (6), as well as recent studies by Battle et al. (7) and Li et al. (1), challenge this conclusion, suggesting that transcriptional control makes the larger contribution.

Earlier studies used mass spectrometry, DNA microarrays, and mRNA sequencing (mRNA-Seq) to measure protein and mRNA levels for thousands of genes (25), and also to measure the rates of mRNA degradation, translation, and/or protein degradation (through labeling with stable isotopes) (4, 5). Some studies examined a single cell type at steady state (5), whereas others analyzed the differences between tissue types (4), between tumors (2), or between inbred mouse strains (3). Each study found a moderate to low correlation between protein and mRNA abundance data (coefficient of determination R2 ≤ 0.4). This was taken to suggest that no more than 40% of the variance in protein levels is explained by variance in the rates of transcription and mRNA degradation and, by implication, that the remaining variance in protein expression (≥60%) is explained by translation and protein degradation (25). By employing degradation rate data for mRNAs and proteins in addition to abundance data, it was further estimated that transcription explains 34% of the variance in protein abundance, mRNA degradation 6%, translation 55%, and protein degradation 5% (5) (see the figure).

The high-throughput methods used in these studies, however, show substantial stochastic variation between replica data and also suffer systematic, reproducible biases (1, 46, 8, 9). For example, label-free mass spectrometry can underestimate amounts of all lower-abundance proteins by as much as a factor of 10 (1, 8), and mRNA-Seq data are biased by guanine-cytosine base pair content by a factor of up to 3 (9). Because each type of error has different causes and because RNA and protein techniques differ greatly, the errors should be uncorrelated. Thus, the correlation of protein versus mRNA as measured will be lower than that between error-free data. The papers by Jovanovic et al., Battle et al., and Li et al. used careful statistical efforts to estimate and/or reduce the impact of errors and thereby find the higher correlation expected between true protein and true mRNA levels.

Jovanovic et al. examined mouse bone marrow dendritic cells at steady state and during response to bacterial lipopolysaccharide (LPS) (6). They used a Bayesian model to estimate the true rates of translation and protein destruction from noisy mass spectrometry data. In addition, three independent estimates of protein abundance were made from three samples, each digested with a different protease. These three differently biased estimates were then used in separate parts of the analysis to avoid a confounding dependency on common errors that would result if a single estimate were used throughout. Filtering of unreliable data and estimates of stochastic mRNA-Seq errors, in addition, allowed Jovanovic et al. to calculate that at steady state, mRNA levels explain 68% of the variance in protein expression, translation rates 26%, and protein degradation rates 8%. Upon stimulation of cells with LPS, mRNA levels appear to explain 90% of the changes in protein expression, with translation and protein degradation explaining only 4% and 6%, respectively. Jovanovic et al. did find, though, that upon LPS treatment, translation and protein degradation rates changed more for ribosomal, mitochondrial, and other highly expressed housekeeping proteins than for other genes, indicating an important role for these two steps in the control of some processes.

Control of protein expression.

The charts show the percent contributions of the variance in the rates of each step in gene expression to the variance in protein abundance for 4212 genes (from a mouse cell line). The left chart shows estimates from (5); the right chart shows estimates from (1) that take into account stochastic and systematic errors in the abundance data of (5).

Battle et al. took a different tack, examining human protein variation among 62 individuals from the Yoruba population of Ibadan, Nigeria (7). Genomic DNA sequences for each individual were compared to mRNA-Seq, ribosome footprinting (ribosome density per mRNA), and mass spectrometry data for lymphoblastoid cells derived from each person. Consistent with previous results (3), the variation in measured protein levels between individuals correlates poorly with the variation in measured mRNA abundances (mean R2 < 0.2) (7). However, when only those differences in expression that are associated with variation in the DNA sequence of a nearby gene were considered, most gene loci showing changes in protein levels between individuals also showed correlated differences in mRNA expression, consistent with a dominant role for transcription. In addition, there was “a scarcity” of DNA sequence changes that affected only ribosome footprint density and protein abundance, not mRNA levels. In effect, by constraining their analysis to only those differences in expression associated with DNA sequence variation, Battle et al. excluded much of the variation due to measurement errors to obtain a more accurate answer.

Li et al. (1) (our own study) reanalyzed data in (5) with two approaches to account for measurement errors. In the first, a nonlinear scaling error in protein abundance estimates (from mass spectrometry data) was corrected using classic data from the literature, and a subset of the other errors in the mRNA-Seq and protein abundance data was estimated from replica and other control data. In the second approach, variance in translation rates measured directly by ribosome footprinting was substituted for a larger variance that had been inferred indirectly with a model in (5). The first approach suggests that the variance in true mRNA levels explains a minimum of 56% of the variance in true protein levels. The second implies that true mRNA levels explain 84% of the variance in true protein expression, transcription 73%, RNA degradation 11%, and translation and protein degradation each only 8%.

Most controllers of gene expression identified by classic genetic or biochemical methods are either transcription factors or proteins (such as kinases and signaling receptors) that directly regulate the activities of proteins, not their abundances. In addition, translation and mRNA degradation rates change only modestly upon cellular differentiation or when microRNA expression is perturbed (1012). Moreover, improved statistical analyses show that in contrast to earlier studies, mRNA levels explain most of the variance in protein abundances in yeast (13, 14). Finally, ∼40% of genes in a single mammalian cell express no mRNA (1, 15); thus, for these ∼8800 genes, transcriptional repression by chromatin is likely the sole determinant of the absence of protein expression.

Understanding the contributions of transcriptional versus posttranscriptional control is not simply a matter of academic interest. For example, variation in protein expression among 95 colorectal tumor samples is only poorly explained by measured mRNA abundances (2), which might imply that different responses of patients to anticancer treatments are posttranscriptional effects. If, however, most of the variation in protein levels is controlled by transcription but this fact is obscured by measurement errors, then differences in drug action could be mainly explained by variation at the transcriptional level.

Accurate quantitation of the control of gene expression is in its infancy. Experimental protocols with fewer inherent biases are needed, along with further improvements in statistical methods that can estimate and take error into account. Before gene expression can be correctly modeled, an accurate accounting of molecular abundances and expression rate constants is vital.

References and Notes

  1. Acknowledgments: J.J.L. was supported in part by the Department of Statistics at UCLA. Work at Lawrence Berkeley Laboratory National Laboratory was conducted under U.S. Department of Energy contract DEAC02-05CH11231.
View Abstract

Stay Connected to Science

Navigate This Article