Report

Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics

See allHide authors and affiliations

Science  16 May 2008:
Vol. 320, Issue 5878, pp. 938-941
DOI: 10.1126/science.1157956

Abstract

We have assembled a proteome map for Arabidopsis thaliana from high-density, organ-specific proteome catalogs that we generated for different organs, developmental stages, and undifferentiated cultured cells. We matched 86,456 unique peptides to 13,029 proteins and provide expression evidence for 57 gene models that are not represented in the TAIR7 protein database. Analysis of the proteome identified organ-specific biomarkers and allowed us to compile an organ-specific set of proteotypic peptides for 4105 proteins to facilitate targeted quantitative proteomics surveys. Quantitative information for the identified proteins was used to establish correlations between transcript and protein accumulation in different plant organs. The Arabidopsis proteome map provides information about genome activity and proteome assembly and is available as a resource for plant systems biology.

Sequencing of complete genomes has advanced our understanding of biological systems, mostly by enabling a broad range of technologies for the analysis of gene functions and by providing information about the theoretical protein-coding capacity of organisms (1). Because proteins are usually the effectors of biological function, knowledge about their expression levels provides relevant information for the characterization of a biological system. Mass spectrometry instruments with increased detection sensitivity, together with protein and peptide fractionation technologies and data analysis tools, have facilitated cataloguing of proteomes to acquire information about functional properties and activities of the genome (24).

To assemble a high-density Arabidopsis proteome map, we performed 1354 LTQ (linear trap quadrupole) ion-trap mass spectrometry runs with protein extracts from six different organs [fig. S1, table S1, and (5)]. The resulting data files were analyzed with two search algorithms, PeptideProphet (6) and PepSplice (7) (fig. S2). We identified 13,029 proteins with 86,456 unique peptides originating from 790,181 tandem mass spectrometry (MS/MS) spectrum assignments at a false-discovery rate below 1%. The data set of 13,029 proteins is formed by merging the set of 10,902 distinct proteins identified from plant organs, including roots, cotyledons, juvenile leaves, flower buds, open flowers, carpels, siliques, and seeds, with the set of 8698 proteins identified from undifferentiated cultured cells (Table 1). Together, these proteins represent assignments for nearly 50% of all predicted Arabidopsis gene models. Our data set is publicly available in the PRIDE database (8, 9), together with information about protein and peptide identification, as well as the corresponding original MS/MS spectra to ensure compliance with the current standards for proteome data deposition (MIAPE) (10). The data can be queried in the PRIDE BioMart at www.ebi.ac.uk/pride/prideMart.do, and an enhanced view of the data set is available from our server at www.AtProteome.ethz.ch.

Table 1.

Number of assigned spectra, distinct peptides, and proteins in different samples and organs. Mol. mass, average molecular mass in kD.

View this table:

We evaluated the distribution of the identified proteins into different biological processes on the basis of the TAIR7 Gene Ontology (GO) annotation (11), using the elim method provided with topGO (12), and performed Fisher's exact test to assess the significance of over- or underrepresentation of GO categories compared with all proteins in the Arabidopsis database. Our analysis revealed an underrepresentation of known low-abundance proteins, such as those involved in transcriptional regulation and signaling, and an overrepresentation of proteins involved in basic metabolic processes, including glycolysis, photosynthesis, cellulose synthesis, and translation (fig. S3). We furthermore observed a preferential detection of large proteins (Table 1). This known bias is particularly pronounced for very complex protein mixtures in high-throughput proteomics (3, 13). In order to mitigate this detection bias, we enriched for low-molecular-mass proteins from cultured cells by alternative gel electrophoresis on 10% Tricine gels. This approach added 714 (∼9%) protein identifications with an average molecular mass of 41.2 kD to the cell culture protein set (Table 1).

PepSplice was specifically designed for operating in large search spaces (7), which allowed us to identify peptides containing post-translational modifications (5) and peptides with nontryptic ends in extended database searches, including protein N termini with N-terminal acetylation or with their initiator methionine removed (tables S2 and S3). Most of the detected modified peptides were either oxidized at methionine (10,089) or tryptophan (347) or carbamidomethylated at cysteine (11,373). Amino acid carbamidomethylation and oxidation usually occur during sample preparation in vitro, but a function for methionine and tryptophan oxidation in signal transduction in vivo is currently being discussed (14). We also identified 195 N-terminal acetylated peptides. Because acetylation can be catalyzed by acetyltransferases in vivo, the identified peptides provide information about the substrate spectrum and activity of acetyltransferases (table S3).

We used the PepSplice extended search functionality to match all MS/MS spectra against the TAIR7 genome database and identified peptides from genome regions that have no annotated protein-coding capacity. We required at least two distinct peptides to support a gene model different from those in the Arabidopsis protein database. We found 57 new or alternative gene models based on 261 unambiguously identified unique peptides from 2671 spectrum assignments. The revised gene models (Fig. 1 and table S4) fall into several different categories. For 22 annotated gene models, we found different 5′ or 3′ ends. In seven gene models, peptides were identified in predicted intron sequences. We also identified peptides from seven intergenic regions and 15 pseudogenes, which suggests that these genome regions are expressed. Six of the detected pseudogenes are related to open reading frames in transposable elements, which are often listed as pseudogenes, although some are known to be transcribed and translated. Expression of the pseudogenes was further validated by analysis of recent TILING arrays, in which 12 of the 15 pseudogenes were found to be transcribed (15). For two annotated gene models, we were able to establish a different open reading frame, whereas four new gene models represent a mixture of the different categories detailed above. Altogether, EST evidence was found for 185 out of the 261 peptides in GenBank (table S4). Genscan, a de novo gene prediction algorithm (16), calculated 226 of the 261 peptides, which encompassed 51 of the 57 new or alternative gene models (table S4).

Fig. 1.

New or alternative gene models identified by expression evidence from identified peptides. (A to E) Five examples of newly identified gene models. The upper blue line depicts the gene model in the TAIR 7 protein database, the red boxes indicate the localization of the peptides identified in the whole-genome search, the blue boxes are the peptides of the corresponding gene model identified in the standard protein database search, and the gray line represents gene prediction by alternative gene prediction tools [Genscan (16), Twinscan (27), EuGENE (28)]. The different categories of gene model revisions include (A) evidence for a different START/5′ end for gene model AT4G17330, (B) evidence for a different STOP/3′ end for gene model AT2G42600, (C) evidence for gene expression in the intergenic region between the two gene models AT3G49600 and AT3G49610, (D) evidence for different splicing of AT3G06530, and (E) evidence for a different reading frame within gene model AT5G39570.

We used topGO to compare the GO category distribution of the cataloged proteins in each Arabidopsis organ with the distribution of proteins in the entire map of identified Arabidopsis proteins (11, 12). Proteins in GO categories translation and glycolysis were overrepresented in all organs, whereas proteins were overrepresented for photosynthesis and chloroplast organization in leaves; for intracellular protein transport, response to oxidative stress, and toxin catabolic process in roots; and for response to heat stress and embryo development in seeds (Fig. 2A and table S5). Thus, each plant organ can be assigned a specific and functionally significant proteome map. Proteins in the GO category RNA metabolism were overrepresented in the proteome of cultured cells, which may reflect their high cell-division rate and their unique metabolism in the presence of sucrose.

Fig. 2.

Characterization of protein identifications. (A) Functional classification of proteins into TAIR GO categories from the aspect “biological process” with topGO using the elim method (11, 12). Fisher's exact test was used for assessing GO term significance. Shown are the overrepresented GO categories of the proteins in each of the organs as compared with all identified proteins with a P value < 10–6. (B) The Spearman rank correlation coefficients between the sets of proteins identified in each organ. (C) The distribution of protein identification in relation to transcript levels (i.e., the log-transformed arithmetic mean of transcript levels). Shown in blue are those proteins that were not detected in our proteome analysis, even though their genes are represented on the Affymetrix Genechip array and in red, those for which the protein was detected.

For a more detailed comparison of the different organ proteomes on a genome-wide scale, we modified the APEX-indexing method to calculate approximate abundance values for all identified proteins (17). From the values obtained for each protein in the different organs, we calculated a correlation matrix to assess the degree of similarity between the different organs (Fig. 2B). The pairwise comparison of undifferentiated cultured cells as a reference with cells from differentiated organs resulted in Spearman rank correlation values ranging from 0.33 for the seed proteome up to 0.46 for the root proteome. Among the proteomes of differentiated organs, the correlation values range from a minimum of 0.39 between the root and leaf and the seed proteome to a maximum of 0.60 between the flower and silique proteome. These correlation coefficients support the results in Fig. 2A and indicate that specialization between different plant organs is reflected in the differential accumulation of proteins.

We next identified proteins from our data set that were found in only one organ, no others, with at least three different spectra. These proteins we called “organ-specific biomarkers.” The biomarkers are enriched for specific functional categories (fig. S4 and table S6) and support the GO term assignments of organ-enriched functional proteome maps (Fig. 2A). Our list of 571 organ-specific proteins (table S7) may help identify cis-regulatory elements that control the organ-specific expression of the corresponding gene models. We compared the distribution of the 571 organ-specific biomarkers (table S7) with the distribution of biomarkers identified with transcriptional profiling, by using the Genevestigator anatomy profiles (18). We found that the two biomarker data sets cluster similarly in the different organs, which validates the specificity of biomarker detection using proteomics and transcriptomics (fig. S5).

After quantifying proteins using the modified APEX-indexing method described above, we integrated transcriptional profiling data from Genevestigator with our proteomics data to assess the correlation between transcript levels and protein accumulation in different organs (18). Our proteome analysis preferentially detected proteins that are expressed at higher transcript frequencies (Fig. 2C). To quantify this effect, we calculated the correlation coefficients from the transcript and protein levels in the different organs. The highest correlation coefficient of 0.68 was found for leaves, and the lowest, 0.52, was found for seeds (fig. S6). Seeds contain a high percentage of stable storage proteins that are deposited in protein bodies, which could explain the low correlation between transcript and protein accumulation compared with other organs. Although transcript and proteomics data were obtained from similar, but distinct, samples and from different experiments, the positive correlation in the different organs suggests that this approach is robust. Overall, the correlation analysis between transcript and protein accumulation at a genome-wide scale suggests that the accumulation of proteins in Arabidopsis is primarily regulated at the transcript level. More detailed information will be required to establish the level of posttranscriptional control for individual genes.

Targeted quantitative proteomics requires comprehensive information about detectable peptides that unambiguously identify a protein. Prediction efforts depend on peptide properties and are useful but limited in reliability, because ion suppression effects from coeluting analyte molecules influence which peptides are detectable (19). With the constraint that a peptide must be detected with at least three different spectra in a fraction (table S1) in order to be considered proteotypic, we found that the majority of proteotypic peptides were only detected in one fraction or organ, and only a few peptides were detected multiple times (Fig. 3, A and B). One possibility to establish a selection of reliably detectable peptides is to consider only those peptides as proteotypic that are observed in more than 50% of all identifications of the corresponding protein (20). Such a strict definition, however, does not allow for a systematic assessment of peptide traceability, because it does not distinguish between peptide samples that were generated with different extraction methods or from different plant samples. An illustrative example for this issue is acetyl-CoA C-acetyltransferase (AT5G48230), for which different peptides were detected in different organs and different fractions (Fig. 3C).

Fig. 3.

Organ- and fraction-specific detection of proteotypic peptides. (A) Distribution of proteotypic peptides in different organs. The majority (65%) of all proteotypic peptides reported here were detected in only one organ (see number of organs, 1), with a pronounced drop in the number of proteotypic peptides identified in more than one organ, and only 1.3% identified an all organs. (B) Distribution of proteotypic peptides in different fractions. The same trend as in (A) applies to the detection of proteotypic peptides in different fractions. (C) Example for the fraction- and organ-specific detection of proteotypic peptides (gray boxes) from acetyl-CoA C-acetyltransferase (AT5G48230).

The Arabidopsis proteome map provides a detailed map of 14,867 organ-specific proteotypic peptides, which accounts for the diverse composition of protein samples and confers higher sensitivity to proteotypic peptide selection for targeted and quantitative proteomics. Similar proteome maps are available for Drosophila, human, and yeast, and the Drosophila and human proteome maps have pointed to gene structures not identified by other means (3, 2123). Collectively, these proteomics data complement other strategies for genome annotation and gene prediction. The quantitative proteome map we have assembled for Arabidopsis will also facilitate genome-scale transcript and protein abundance correlation analyses to increase our understanding of gene expression control in specific tissues or organs (24, 25). The library of Arabidopsis organ-specific proteotypic peptides now allows expanding quantitative correlation analyses to high-resolution surveys of metabolic or regulatory pathways, or even individual enzymes, by sensitive detection and quantification of minute amounts of protein (26). Organ-specific proteotypic peptide maps are key to the successful design of such targeted proteomics surveys (supporting online material) and allow proteomics to be used as a routine scoring method in plant systems biology.

Supporting Online Material

www.sciencemag.org/cgi/content/full/1157956/DC1

Materials and Methods

SOM Text

Figs. S1 to S7

Tables S1 to S7

References

References and Notes

View Abstract

Navigate This Article