Research Article

A genome-wide transcriptomic analysis of protein-coding genes in human blood cells

See allHide authors and affiliations

Science  20 Dec 2019:
Vol. 366, Issue 6472, eaax9198
DOI: 10.1126/science.aax9198

A blood cell protein-expression atlas

Genome-wide analyses are increasingly providing resources for advances in basic and applied biomedical science. Uhlen et al. performed a global expression analysis of human blood cell types and integrated this data with data across all major human tissues and organs in the human protein atlas. This comprehensive compendium allows for classification of all human protein-coding genes with regard to their tissue- and cell-type distribution.

Science, this issue p. eaax9198

Structured Abstract

INTRODUCTION

Blood is the predominant source for molecular analyses in humans, both in clinical and research settings, and is the target for many therapeutic strategies, emphasizing the need for comprehensive molecular maps of the cells constituting human blood. The Human Protein Atlas program (www.proteinatlas.org) is an open-access database that aims to map all human proteins by integrating various omics technologies, including antibody-based imaging. Previously, the Human Protein Atlas included gene expression information from peripheral blood mononuclear cells but not the many subpopulations of blood cells within this cell type. To increase the resolution, we performed an in-depth characterization of the constituent cells in blood to provide a detailed view of the gene expression in individual human blood cells and relate these to the other tissues in the body.

RATIONALE

A quantitative transcriptomics-based expression analysis was performed in 18 canonical immune cell populations (Fig. 1) isolated by flow cytometric sorting. The blood cell expression profiles are presented in combination with expression profiles of tissues, including transcriptomics data from external sources to expand the number of tissue types as well as brain regions included in the database. A genome-wide classification of the protein-coding genes has been performed in terms of expression specificity and distribution, both in blood cells and tissues.

RESULTS

We present an atlas of the expression of all protein-coding genes in human blood cells, integrated with a classification of the specificity and distribution of all protein-coding genes in all major tissues and organs in the human body. A genome-wide analysis of blood cell RNA expression profiles allowed the identification of genes with elevated expression in various immune cells, confirming well-known protein markers, but also identified novel targets for in-depth analysis. There are 1448 protein-coding genes that have enriched expression in a single immune cell type. It will be interesting to study the corresponding proteins further to explore the biological functions linked to the respective cell phenotypes. A network plot of all cell type–enriched and group-enriched genes (Fig. 1B) reveals that many of the cell type–enriched genes are in neutrophils, eosinophils, and plasmacytoid dendritic cells, while many of the elevated genes in T and B cells are group-enriched across subpopulations of these lymphocytes. To illustrate the usefulness of this resource, we show the cellular distribution of genes known to cause primary immunodeficiencies in humans and find that many of these genes are expressed in cells not currently implicated in these diseases, illustrating how this global atlas can help us better understand the function of specific genes across cells and tissues in humans.

CONCLUSION

In this study, we have performed a genome-wide transcriptomic analysis of protein-coding genes in sorted blood immune cell populations to characterize the expression levels of each individual gene across all cell types. All data are presented in an interactive, open-access Blood Atlas as part of the Human Protein Atlas and are integrated with expression profiles across all major tissues to provide spatial classification of all protein-coding genes. This allows for a genome-wide exploration of the expression profiles across human immune cell populations and all major human tissues and organs.

Fig. 1 Outline of the analysis of human single blood cell types.

(A) A schematic view of the hematopoietic differentiation. This study analyzes the cell types shown in the bottom row. NK, natural killer. (B) Network plot showing the number of cell type– (red) and group-enriched (yellow) genes in the 18 cell types. The network is limited to nodes with a minimum of seven genes. DC, dendritic cell; T-reg, regulatory T cell; gdT cell, gamma delta T cell; MAIT, mucosal associated invariant.

Abstract

Blood is the predominant source for molecular analyses in humans, both in clinical and research settings. It is the target for many therapeutic strategies, emphasizing the need for comprehensive molecular maps of the cells constituting human blood. In this study, we performed a genome-wide transcriptomic analysis of protein-coding genes in sorted blood immune cell populations to characterize the expression levels of each individual gene across the blood cell types. All data are presented in an interactive, open-access Blood Atlas as part of the Human Protein Atlas and are integrated with expression profiles across all major tissues to provide spatial classification of all protein-coding genes. This allows for a genome-wide exploration of the expression profiles across human immune cell populations and all major human tissues and organs.

Resolving the molecular details of proteome variation in the different cells, tissues, and organs of the human body may considerably increase our knowledge of human biology and disease. Several efforts to map the molecular components of the human body in a comprehensive manner have been initiated, including efforts to generate experimental data such as the Human Cell Atlas (1), the Human Biomolecular Atlas Program (HuBMAP) (2), the Biohub (3), the Genotype-Tissue Expression (GTEx) project (4), the Functional Annotation of the Mammalian Genome (FANTOM) project (5), and the Allen Brain Atlas (6), involving many alternative technologies, including single-cell genomics (7), in situ analysis (8), transcriptomics (9), proteomics (10), and antibody-based profiling (11). In addition, several knowledge resources have been created to annotate, assemble, and integrate data from such sources, such as UniProt (12), ELIXIR (13), ArrayExpress (14), Peptide Atlas (15), and ImmPort (16). The combined efforts of these resources have the potential to allow a systematic knowledge base of the molecular components of human life that will aid a systems biology understanding of human biology and diseases.

A complement to these efforts is the Human Protein Atlas program (17), which is exploring the human proteome using gene-centric and genome-wide antibody-based profiling on tissue microarrays. This allows for spatial pathology-based annotation of protein expression that is performed in combination with deep sequencing transcriptomics profiling of the same tissue types. The aim is to map all human proteins in cells, tissues, and organs using integration of various omics technologies, including antibody-based imaging, mass spectrometry–based proteomics, and transcriptomics. The earlier version of the Human Protein Atlas consists of three separate parts, each focusing on a particular aspect of the genome-wide analysis of human proteins: the Tissue Atlas (17), showing the distribution of proteins across all major tissues and organs in the human body; the Cell Atlas (18), showing the subcellular localization of proteins in single cells; and the Pathology Atlas (19), showing the impact of different protein levels in tumor tissue on the survival of cancer patients. However, there is a lack of data regarding protein expression levels in human blood cells. Given that blood is the most commonly used material for molecular analyses in clinical labs and in research, characterizing the constituents of blood and updating the Human Protein Atlas with a more fine-grained view of the immune cells in blood will be of importance.

In this study, we performed a quantitative expression analysis of 18 canonical immune cell populations, as well as total peripheral blood mononuclear cells (PBMCs) from human blood separated by flow cytometric sorting. The data are integrated with recent transcriptomics efforts involving flow sorting of blood cells, including the analysis in 15 blood cell types by Schmiedel et al. (20) and 29 blood cell types as well as total PBMCs by Monaco et al. (21). We presented the expression profiles in specific cell populations and combined the new single-cell blood data with the data from the Tissue Atlas (17) by incorporating transcriptomics data from the GTEx (4) and the FANTOM5 (5) projects. Moreover, we expanded the set of normal tissue samples by adding tissues such as retina and tongue, as well as extensive data covering the different regions of the brain. A genome-wide classification of the protein-coding genes with regard to tissue and cell distribution as well as specificity has been performed using between-sample normalized data (22, 23). The results are presented in an interactive database (www.proteinatlas.org) that can serve as a reference for researchers interested in spatial expression profiles of human blood cells in relation to the body-wide profiles in all major tissues and organs.

Transcriptome analysis of isolated human immune cell populations

We used flow cytometric sorting to allow whole-genome transcriptome analysis of the major blood cell types from human blood (Fig. 1A). Whole blood was collected from six healthy individuals, and 18 immune cell types were separated by flow cytometric sorting, as outlined in Fig. 1B. The cell types recovered included naïve and memory B cells, CD4 and CD8 T cell populations, natural killer (NK) cells, three monocyte subsets, neutrophils, eosinophils, and basophils, as well as plasmacytoid and myeloid dendritic cells. These can be classified into six different blood cell lineages consisting of granulocytes, monocytes, T cells, B cells, dendritic cells, and NK cells. The sorted cells were immediately processed using RNA extraction and cDNA generation followed by deep mRNA sequencing. The RNA expression levels were determined for all protein-coding genes (n = 19,670) across the 18 immune cell populations and visualized in a newly created Blood Atlas, launched here as an extended edition of the open-access Human Protein Atlas (www.proteinatlas.org/blood).

Fig. 1 Outline of the analysis of human single blood cell types.

(A) A schematic view of the hematopoietic differentiation with the cell types analyzed in this study highlighted. HSC, hematopoietic stemcell; CMP, common myeloid progenitor; CLP, common lymphoid progenitor; RBC, red blood cell; mDC, myeloid dendritic cell; pDC, plasmacytoid dendritic cell. (B) A schematic view of the experimental procedure to analyze the transcript expression levels in human single cell types. The 18 cell types listed include seven subsets of T cells, two variants of B cells, three different monocytic cell types, and the three known forms of granulocytes.

In the Blood Atlas, the expression levels for each of the 19,670 genes are displayed for the 18 cell types and PBMC as exemplified in Fig. 2A. The first example is the G-coupled C-C motif chemokine receptor 3 (CCR3), involved in allergic reactions, showing distinct expression in basophil and eosinophils, with much lower levels in neutrophils. Next, the secretin propeptide (SCT), previously described (24) as being produced in the gastrointestinal tract (duodenum and colon), is here found to also be expressed in the human plasmacytoid dendritic cells. The killer cell lectin like receptor F1 (KLRF1), known to stimulate cytotoxicity and cytokine release in NK cells (25), is an example of an NK cell enriched gene, but the data also show expression in gamma delta T (gdT) cells and mucosal-associated T invariant (MAIT) cells. The purity of our sorting is verified by known marker expression patterns, such as the canonical cell surface receptor CD19, exclusively expressed in B cells, and the cytotoxic T lymphocyte–associated protein 4 (CTLA4), expressed on regulatory T cells (Tregs). The complement C1q A chain (C1QA) of the complement system is instead enriched in monocytes, and the profiling shows high expression in intermediate and nonclassical monocytes but no expression in classical monocytes. In addition to the sorted single cell type populations, the mixed PBMCs were collected from the individuals, as described before (26), and the transcriptome determined.

Fig. 2 The expression profiles of the protein-coding genes in human single blood cell types.

(A) Examples of expression profiles for six genes enriched in one of the cell lineages (see www.proteinatlas.org for details). (B) A UMAP analysis of the relationship between the global expression patterns in all the 109 blood cell samples analyzed here. (C) A heatmap showing the pairwise Spearman correlation between the global expression profiles for the 18 analyzed cell types. (D) Transcriptomics-derived hematopoietic tree showing the similarities in global expression patterns between different human blood cell types. (E) UMAP analysis showing the relationship between all the blood cell samples from three different sources. Cell types overlapping between two or all three datasets are connected by dotted lines. (F) Comparison of expression profiles for the three datasets, as exemplified for the genes CD22 and CSF1R (see www.proteinatlas.org for details).

Global expression profiles for the blood cell types

The relationships between all blood cell samples on the basis of their global expression profiles were analyzed using different algorithms, including principal components analysis (PCA) (27) and uniform manifold approximation and projection (UMAP) (28), and the UMAP results for all samples for all cell types are shown in Fig. 2B. The samples from the different cell types showed similar global expression profiles with the multitude of different B cell and T cell types clustering together. A heatmap based on pairwise Spearman correlation of the expression profiles of the 18 cell types (Fig. 2C) showed that cells of similar origin have similar overall expression profiles, with the three granulocyte cell types having the most distinct expression profiles. All lymphocytes form a separate cluster, including all seven T cells clustering together with the NK cells, and naïve and mature B cells clustering together. The monocytes are most closely related to the myeloid dendritic cells and the plasmacytoid dendritic cells. To analyze the similarities between the cell types of different origins in more detail, we constructed a transcriptomics-derived hematopoietic tree (Fig. 2D) to further illustrate the relation in global expression profiles between the different single blood cell types.

The transcript expression profiles from the recent studies by Schmiedel et al. (20) and Monaco et al. (21), having partially overlapping data for 13 and 27 blood cell types, respectively, are also included in the Blood Atlas. UMAP results for all cell types from the three different data sources are shown in Fig. 2E, confirming the distinct expression profiles between various types of blood cells. A summary of the genome-wide expression levels from all three datasets is visualized for all protein-coding genes in the Blood Atlas resource online (Fig. 2F). More in-depth analyses are needed to establish whether the differences seen are due to differential activation states based on sample handling, differences in sample handling and cell sorting, or whether they reflect biological differences among cohorts, representing individuals from Europe (this study), the United States (20), and Asia (21).

Genome-wide transcriptomics profiles across all major organs and tissues

With the new data covering the blood cell expression profiles as well as an expanded set of normal tissue types, the body-wide tissue profiling performed earlier (29) was revised. Because the brain regions were only superficially covered in the earlier analysis, we also decided to include more brain regions using publicly available data from the GTEx (4) and FANTOM (5) consortia to allow for more in-depth coverage of the different regions of the human brain. Altogether, 1710 samples from selected human brain regions were added to the classification covering 23 human subregions and summarized into 12 main structures of the brain (Fig. 3A). The detailed analysis of the protein expression in these brain structures will be described elsewhere, but here the expression profiles were used in the body-wide tissue classification of all genes. In addition, the five tissues dominated by immune cells (thymus, appendix, spleen, lymph node, and tonsil) were summarized into “lymphoid tissues,” and the four highly related tissues from the gut (duodenum, small intestine, colon, and rectum) were summarized into “intestine,” as outlined in Fig. 3A. Some additional tissues, including lactating breast, vagina, retina, ductus deferens, and tongue, were also added to the comparative analysis. The expression data for the 18 blood cell types as well as PBMC described above were summarized into “blood.” A body-wide classification based on the genome-wide expression profiles of the protein-coding genes was performed with 171 different cells, tissues, and organs, which are summarized into 37 tissue types.

Fig. 3 Classification of the human global gene expression profiles across all major tissues and organs and the immune cell types.

(A) Schematic view of all human tissues and organs analyzed. (B) The number of detected genes in selected tissues based on pTPM and NX values, respectively. (C) Three examples of tissues introduced in this study. (D) Pie chart showing the number of genes classified according to the specificity categories. (E) (Left) A dendrogram based on the correlation of global expression profiles across all tissues and organs, including blood. (Right) Barplot displaying the number of elevated genes for each tissue type. (F) Chord diagram showing the relationship between the distribution classification and the specificity classification. Each link represents the number of genes with the linked distribution category and specificity category.

The transcriptomics data was normalized by applying two different strategies with the main objective to allow (i) within-sample comparisons and (ii) between-sample comparisons, respectively, as outlined in fig. S1. For the within-sample comparisons, the fraction of transcripts corresponding to a particular gene is used. We focus on the protein-coding transcripts and the fraction of transcripts per million of total transcripts from protein-coding genes (pTPM) calculated for each individual gene in every sample. The pTPM value is visualized on the Blood Atlas page of the Human Protein Atlas across the samples for each of the genes. The pTPM values can be considered as the within-sample normalized data from the deep sequencing, in which noncoding RNA has been excluded from the analysis. The pTPM values can be used to investigate the abundance of a particular gene, gene family, or gene class relative to all other transcripts in a particular cell, tissue, or organ.

The second normalization strategy is carried out to allow for comparisons across samples and to avoid batch effects caused by sampling, technology platforms, or the difference in transcriptome size between different types of tissues, as exemplified by pancreas and salivary gland, where a small number of genes are very highly expressed (22, 23). This is particularly important when tissue samples based on different transcriptomic technology platforms have been used, as described for the tissue analysis where RNA sequencing data from multiple sources as well as cap analysis of gene expression data from the FANTOM5 program have been combined. Here, we used a normalization based on trimmed mean of M values (TMM) (30), Pareto scaling (31), and the Limma R package (32) to calculate a normalized expression value (NX) for each gene in every sample. In the Human Protein Atlas, the NX value for each gene is visualized in parallel with the pTPM value for all tissues and cell types. The objective of using the NX value is to facilitate the analysis of differences in expression of genes between cells, tissues, and organs and to allow for a specificity classification based on the genome-wide expression of all genes across the human blood cells, tissues, and organs.

The number of detected genes in the different tissues and organs was investigated using both the within-sample normalization (pTPM) and the between-sample normalization (NX), in both cases using a cutoff value of 1, as described previously (17). In Fig. 3B and fig. S2, the results for selected tissues are shown, and the analysis demonstrated a similar number of detected genes for most samples, with some notable exceptions, including tissues with a small fraction of highly abundant transcripts, such as bone marrow (hemoglobin), pancreas (digestive enzymes), liver (albumin), and salivary gland (digestive proteins).

The revised tissue classification of all human genes

The extended data allowed us to refine the classification for the putative protein-coding genes on the basis of their expression across all 37 cells, tissues, and organs. Some examples of genes detected in the recently added tissues are shown in Fig. 3C. The first example, CRABP2 in vagina, plays a role in the vitamin A signaling pathway, with tissue-enhanced expression in squamous mucosa and with nuclear and cytoplasmic positivity in suprabasal squamous epithelia. Another example is breast with ZNF80, a protein with unknown function that here shows nuclear positivity with tissue enhanced expression in blood and breast tissue. Also shown is retinal epithelium with cone-rod homeobox protein (CRX), showing nuclear positivity in the cone-and-rod photoreceptor layer.

All 19,670 genes were classified according to a strategy based on scoring both tissue specificity and tissue distribution (tables S1 and S2; full list of results in data S1). Of all protein-coding genes, 56% (n = 11,069) showed elevated expression in at least one of the analyzed tissues, and these were further subdivided into (i) tissue-enriched genes with at least fourfold higher expression levels (based on NX values) in one tissue type as compared with any other analyzed tissue; (ii) group-enriched genes with enriched expression in a small number of tissues (2 to 5); and (iii) tissue-enhanced genes with only moderately elevated expression (table S1). 2845 genes (14%) of the protein-coding genes were found to be enriched in one of the analyzed tissues (Fig. 3D), and only 216 genes were not detected in any of the analyzed tissues. Our classification shows the number of tissue-enriched genes for each tissue type, as well as the number of genes enriched in different groups of tissues (Fig. 3E). The largest number of tissue-enriched genes are found in the testes, as shown in our previous results (17); however, the largest number of elevated genes is now found in the brain, most likely owing to the inclusion of many more brain regions as compared with earlier versions of the atlas. Whereas the specificity classification showed us the enrichment of genes, the distribution classification showed us the fraction of tissues where the gene is expressed. Only 737 genes (4%) are restricted to a single tissue, while almost half of the protein-coding genes are expressed in all tissues (n = 9638) (fig. S3).

The global expression profiles were investigated using the between-sample normalized values (NX) using PCA (fig. S4), UMAP (fig. S5), and hierarchical clustering based on genome-wide correlation between the cells, organs, and tissue types (fig. S6). The resulting dendrogram (Fig. 3E) shows that testis and brain have the most distinct expression profiles compared with all other tissues, and that blood is most highly correlated with lymphoid tissues and bone marrow. The overall results corresponded well with the origin and function of each tissue, as exemplified by many of the female tissues clustering together and the close connectivity of the two tissues composed of striated muscle (cardiac and skeletal muscle).

In Fig. 3F and table S3, a summary of all 19,670 genes with regard to both tissue specificity and distribution classification is shown with the genome-wide relationship of the two classification schemes introduced, showing that only 586 genes are “tissue specific,” meaning they are tissue-enriched and, at the same time, only detected in a single tissue (this list is available at www.proteinatlas.org). Relatively few genes (n = 1637) were found to be group-enriched, and this lower number compared with earlier results (17) is most likely explained by the fact that some tissues have now been grouped together, such as lymphoid tissues, intestine, and brain. 43% (n = 8385) of the genes were classified as “low tissue specificity,” and most of these are found in the “detected in all” category. All 19,670 protein-coding genes in humans have now been analyzed with respect to their tissue specificity and distribution across all major organs, tissues, and blood cells in the human body, and the results are available in the Human Protein Atlas.

Transcriptome usage in different cells and tissues

An analysis of the transcriptome allowed us to determine the fraction of transcripts corresponding to different genes in each analyzed cell type and tissue. Here, we report the transcriptome usage for some representative blood cell types and tissues on the basis of within-sample normalized pTPM values (Fig. 4A and fig. S7A) and between-sample NX normalized values (Fig. 4B and fig. S7B). These are further stratified according to genes coding for secreted, membrane-bound, and intracellular proteins. It is notable that, for pancreas and salivary gland, as much as 80 and 50%, respectively, of the transcripts (based on pTPM) encode for secreted proteins. This demonstrates the extreme specialization of these “secretory cell factories” for production of extracellular proteins, with a few genes dominating the transcriptome load. The most abundant proteins in pancreas code for digestive enzymes, such as lipases (PNLIP, CLPS), proteases (PRSS1, CELA3A), and peptidases (CPA1, CPB1). The most abundant proteins in salivary gland are a protein with essentially unknown function (submaxillary gland androgen regulatory protein 3B, SMR3B) and statherin (STATH), which prevents the precipitation of calcium phosphate in saliva, maintaining a high calcium level in saliva that is necessary for remineralization of tooth enamel. The second- and fourth-most abundant proteins in salivary gland are antimicrobial peptides (HTN3 and HTN1). Similarly, the liver has a large fraction of secreted proteins with the most abundant being albumin (ALB), haptoglobin (HP), and apolipoprotein A2 (APOA2).

Fig. 4 Analysis of the global expression profiles in the various tissues.

(A) The transcriptional load based on pTPM in some selected cells and tissues stratified according to protein location: secreted, membrane-bound, or intracellular. The genes with most abundant transcripts are labeled. (B) Same as (A), but based on the between-sample normalized NX values scaled to a sum of one million. (C) Immunohistochemistry (IHC) images from the Human Protein Atlas for four examples of the most abundant genes in some selected tissues. (D) Boxplot showing the distribution of the number of detected genes for the combined groups of tissue types (brain, blood, intestine, and lymphoid tissues), all single tissue types, the 18 blood cell types, and cell lines (18). (E) The number of genes expressed in all samples is shown based on the earlier analysis (17), and in all tissues, the immune cell types reported here as well as for 60 cell lines. Also shown is the number of genes when including all these three sample types. We also compare the number of genes identified as “essential” using CRISPR knock-out strategies (33, 34) and highlight the number of genes not “detected in all” for all samples covering the cell lines, tissues, and blood cells.

In contrast, >60% of all pTPM values for cardiac muscle code for membrane proteins, mainly consisting of mitochondrial proteins, which is not unexpected given the extreme requirement of energy in the cardiac muscle. For most tissues and for all the single blood cells, the intracellular proteins instead constitute most of the transcriptome load, as exemplified by bone marrow with hemoglobin (HBB) and the skin with keratin (KRT10) (Fig. 4C) as the most abundant transcript, respectively. In the blood cells, there are fewer genes with a dominant abundance, although the most abundant transcript in neutrophils is the gene encoding the intracellular protein ferritin light chain (FTL), a subunit of ferritin, the major protein responsible for intracellular iron storage. A notable example of a gene with abundant transcripts, but with almost no known functional information, is the interferon-induced transmembrane protein 2 (IFITM2), which is highly expressed in neutrophils and here is shown in spleen. The transcriptome maps demonstrate the high specialization of each tissue with a large portion of the transcript burden devoted to functions of relevance for the corresponding cells in respective tissue type.

Number of detected genes and the “housekeeping” genes

An analysis of the number of detected genes in the various samples (Fig. 4D) shows that ~16,000 genes are detected in the four combined groups of multiple tissue types (blood, brain, intestine, and lymphoid tissues), while the analysis of single tissues shows a slightly smaller number of genes (~14,000 on average)—with the exception of testis, in which 16,598 genes are detected. This is in contrast to the much smaller number of detected genes when analyzing cell lines (~9500 genes per cell line) and single blood cell types (~10,000 genes). The fact that more genes are detected in tissues as compared with the single cell type analysis is not unexpected, as it reflects the presence of a multitude of different cell types present in composite tissues. The observation that a slightly smaller number of genes are detected in the cell lines as compared with the single blood cells is interesting, and it is tempting to speculate that this is due to the in vitro specialization of the cell lines.

Almost half (49%) of the protein-coding genes (n = 9638) were detected in all analyzed tissues (Fig. 4E), and these genes include known “housekeeping” genes encoding mitochondrial proteins, and proteins involved in overall cell structure, translation, transcription, and replication. An analysis of the human cell lines shows that 4101 genes are detected in all samples. Similarly, the analysis of the 18 single blood cell types shows that 5874 genes are ubiquitously detected across all immune cells. If the tissues, cell lines, and single blood cell types are combined, the number of protein-coding genes detected in all samples is decreased to 3399 (Fig. 4E). This is still a much larger number when compared with the determination of essential genes using genome-wide CRISPR-Cas9 knock outs (33, 34), which identified 1824 and 1527 genes with unconditional importance for cell survival, respectively. This suggests that many genes are present in all cells but that they perform redundant functions in cell lines. Altogether, we identified genes that are both essential in genome-wide knock-out screens and here detected in all blood cells, cell lines, tissues, and organs. This list of genes (available at www.proteinatlas.org) contains many well-known housekeeping genes involved in replication, translation, and cellular processes, and more in-depth studies are needed to explore the function of the genes detected in all tissues and yet not identified as essential by the knock-out screen.

It is reassuring that the number of “missing genes,” i.e., those not detected in any tissue or cell type, is now reduced to 216, which is only ~1% of the total number of predicted protein-coding genes. We therefore revised (35) the number of genes for which evidence at protein level is present by combining our antibody-based data with the manual annotation of literature by the UniProt consortium (36) and the results from mass spectrometry–based proteogenomics analyses (37). The analysis showed that there are 17,660 protein-coding genes with proteins identified from at least one of the three efforts and 15,155 genes with experimental evidence from at least two of the efforts (fig. S8; see www.proteinatlas.org/humanproteome/proteinevidence for details). Furthermore, there are 1794 additional genes with evidence only at the RNA level, and these genes are obvious targets for more comprehensive functional protein studies. It is notable that chromosome 11 has many more missing genes than the other chromosomes, likely owing to its high number of olfactory genes. A summary of the supporting data in a chromosome-centric manner is shown in the new version of the Human Protein Atlas launched as part of this publication.

Classification of cell type–specific expression profiles in human blood immune cells

We next performed a genome-wide analysis with regard to expression profiles in the blood cells for the identification of proteins with an elevated expression in immune cells. This was performed both on the cell type level (n = 18 cell types) and on cell lineage level in which the various cell types were combined into six groups, including T cells, B cells, and granulocytes (see full list of results in data S1). The number of genes in each of the five specificity categories is shown in Fig. 5A, with 1448 genes classified as cell type–enriched in one of the cell types and 5934 (30%) of all protein-coding genes elevated in at least one of the human blood cell types. Many genes (n = 3797) were not detected in any of the blood cells, while 9939 showed low specificity for expression in blood cells. The cell type distribution (fig. S9) showed that only 1713 genes were detected in a single cell type, while 5934 were detected in all 18 cell types. The relationship of the two classification schemes is compared in fig. S10 and table S4, showing that 889 genes are cell type–enriched and detected in a single cell type. These genes are of interest for further study to explore the biological functions linked to the respective different cell phenotypes. A heatmap showing the transcript expression profiles for all 1448 immune cell type–enriched genes shows that most are found in neutrophils, basophils, and plasmacytoid dendritic cells (Fig. 5B), while the group-enriched genes are more evenly distributed across the 18 cell types.

Fig. 5 Cell type–specific classification of the human blood cells.

(A) The number of genes classified according to cell type specificity. (B) A heatmap showing the expression of all the cell type–enriched genes across the 18 cell types. Heatmaps for the other specificity categories can be found in figs. S12 to S15. (C) Network plot showing the number of cell type– and group-enriched genes in the 18 cell types. The network is limited to nodes with a minimum number of seven genes. (D) (Left) A dendrogram based on the correlation of global expression profiles across the 18 cell types. (Right) Barplot displaying the number of elevated genes for each cell type. (E) The relationship of all human protein-coding genes with regard to single blood cell type specificity and whole-body tissue and organ specificity.

A network plot of all cell type–enriched and group-enriched genes (Fig. 5C) reveals a cluster of genes enriched in T cells and another cluster enriched in myeloid cells. Many genes (n = 114) are also shared between the two types of B cell populations (mature and naïve). In Fig. 5D, the number of elevated genes in the different blood cell types, clustered on the basis of the expression profiles, is shown, again highlighting the many cell type–enriched genes in neutrophils, eosinophils, and plasmacytoid dendritic cells, while many of the elevated genes in T and B cells are group-enriched across subpopulations of these lymphocytes. In fig. S11, all group-enriched and tissue-enriched genes are visualized and the relationship of sharing enriched expression between the cell types can be observed.

The extensive data generated here also allowed us to investigate the relationship between (body-wide) tissue expression and the expression in the single blood cell types. In Fig. 5E, a summary of all individual genes is shown with classification based on distribution in all tissues and blood cell types, respectively, and a summary of the genes that are enriched both on tissue level and blood cell level can be found in fig. S15. Some, but not a majority, of the genes expressed in a single or several blood cell types are shown to be predominately expressed in blood cells even when all major tissues and organs are considered. It is notable that many of the genes detected in all tissues are only detected in some of the blood cell types, suggesting that they are not necessary for cell survival.

Enriched genes among the blood immune cell types

Using our definition of cell population enrichment and cell group enrichment of genes, we analyzed the enriched genes among the 18 immune cell populations. Figure 6A shows the top five genes most enriched for each cell population, colored by their predicted protein location either in the membrane, secreted, or intracellular. Notable examples (Fig. 6B) include catalase (CAT), a gene encoding a key antioxidant enzyme converting the toxic reactive oxygen species hydrogen peroxide to water and oxygen and believed to be expressed broadly in the peroxisome of most cells (38). Our data indicated a strongly enriched expression level of CAT in eosinophils, which is much higher than the expression in any other immune cell population. This finding warrants more mechanistic analyses of CAT in eosinophils. Another notable finding is the chemokine receptor CXCR6, which is more highly expressed by MAIT cells than any other cell population, suggesting a particular importance of this receptor and its ligand, the chemokine CXCL16, in regulating MAIT cell trafficking. MAIT cells are a population of T cells that has gained a lot of interest in recent years for its role in antibacterial defense, particularly on mucosal sites, through its recognition of molecules derived from the bacterial and fungal riboflavin biosynthesis pathway (39). These cells have been shown to express multiple trafficking receptors, and their circulation between blood and tissues has been debated.

Fig. 6 The relationship between blood cell type–specific genes and tissue-specific genes and analysis of genes causing inborn errors of immunity.

(A) The expression levels of all cell type–enriched genes, with the five most abundant genes named. (B) The expression profiles of some selected genes. (C) The results of flow sorting (CyTOF) using antibodies toward GZMB and CD45. (D) A heatmap showing the expression of 224 genes known to cause human inborn errors of immunity and their expression across all major tissues in the human body. A similar heatmap containing the gene names can be found in fig. S18, and separate heatmaps of each major disease type in all blood cells and tissues can be found in fig. S19. (E) IHC images from the Human Protein Atlas for four of the genes causing inborn errors.

Another example is the granzyme B (GZMB) gene, a well-known serine protease secreted in granules by cytotoxic T cells and NK cells and necessary for target cell apoptosis (40). We found that GZMB expression is strongly enriched in plasmacytoid dendritic cells (pDCs). GZMB expression in pDCs has been reported previously (40), but according to our data, GZMB expression in pDCs is about fivefold higher than in any other cell type, which suggests an important function of granzyme B in pDCs (41). It is of interest that the population of pDCs also exhibits elevated levels of several other genes (AXL, PPP1R14A, SIGLEC6, ITM2C, and DAB2) suggested to be specific for a low abundant subgroup of DCs called AS DC with negative GZMB expression, recently described by Villani et al. (42). Because GZMB variants have been associated with the autoimmune disease vitiligo (43), pDCs could potentially play an unappreciated role in the pathogenesis of this condition. To confirm this elevated expression in pDC at the protein level, blood immune cells were analyzed by mass cytometry, and the results confirm higher protein levels of granzyme B in the cytoplasm of pDCs as compared with NK cells and CD8+ T cells (Fig. 6C). The GZMB expression levels examined by mass cytometry could not distinguish the proposed AS DC subgroup within the pDC population.

We also complemented our classification strategy by performing a large number of differential expression analyses based on DESeq2 (44) to identify genes with variable expression when comparing two cell lineages or two cell populations (fig. S17). The comparison between cell lineages B and T cells show many genes with differential expression, including well-known B cell markers, such as CD19, CD22, and CD79, but also several genes not previously described as elevated in B cells, such as Ras associated domain family member 6 (RASSF6) and the zinc finger protein 860 (ZNF860). Similarly, genes identified as T cell markers include well-known genes, such as CD3, CD6, inducible T-cell costimulatory (ICOS), and thymocyte selection associated (THEMIS), but also other genes not yet identified as T cell elevated, such as Ras guanyl releasing protein 1 (RASGRP1) and fibroblast growth factor binding protein 2 (FGFBP2). All significantly differentially expressed genes for each DESeq2 analysis are available as a separate list (data S2).

Cellular expression of genes causing inborn errors of immunity

In a recent listing of primary immunodeficiency diseases (PID), 354 diseases were listed as consequences of monogenic defects in genes associated with the immune system (45) involving 224 known genes. The mechanism of disease is often incompletely understood, and we reasoned that an analysis of cellular expression of identified genes could help generate better hypotheses for further mechanistic investigation. We analyzed the NX levels of 224 PID genes across the 18 sorted immune cell populations, as well as some selected tissue profiles, and identified seven clusters with shared cellular and tissue distribution (Fig. 6D and figs. S18 and S19). A first group (cluster A) consists of 11 proteins restricted to T cells and NK cells, such as CD3 and the signaling intermediates ZAP70 and LCK (Fig. 6E). A second group (cluster B) consists of a subgroup of 15 genes present in all blood cells, but with much lower expression in the other tissues. Cluster C consists of genes ubiquitously expressed across all analyzed tissues and immune cell types. Cluster D consists of 34 proteins mainly originating from the liver and involves known plasma proteins such as complement factors C5, C8, and C9. Cluster E consists of proteins mainly expressed in particular cell lineages, such a B cell–restricted proteins, CD19, and CD79A. Cluster F consists of genes with elevated expression in monocytes and dendritic cells, and cluster G has relatively high expression in lymphoid tissues and bone marrow but low expression in the mature immune cell type in circulation. Several examples of interesting expression patterns can be observed, including the CEBPE gene (cluster E) causing specific granule deficiency 1 (SG1) (46) that has high expression in eosinophils. This condition has been considered a neutrophil-granule deficiency associated with recurrent pyogenic infections, but our cell type expression pattern indicates that CEBPE is mostly expressed by eosinophils and not at all by neutrophils. It is possible that during neutrophil development, or upon stimulation, CEBPE might also be expressed in neutrophils, but our results suggested that eosinophil deficiency should also be considered in SG1. This use case illustrates the usefulness of the updated human protein atlas as novel genes are identified as possible causes of immunodeficiencies and other diseases in human patients.

Discussion

Here, we present an atlas of the expression of all protein-coding genes in human blood cells, and this data has been integrated with an analysis of the tissue specificity of all genes covering all major tissues and organs in the human body. An interactive Blood Atlas resource is presented as part of the Human Protein Atlas, including expression data from other sources, such as blood cell transcriptomics from Monaco et al. (21) and Schmiedel et al. (20). The resource described here enables comparative analysis with other sources of data, such as single-cell genomics, proteomics, and antibody-based measurements, to allow comprehensive molecular profiles of the individual human blood cell types. In addition, the Tissue Atlas (17) was complemented with transcript expression data for brain and other normal tissue types from GTEx (4) and FANTOM5 (5). A normalization strategy has been introduced which has allowed integration of the various diverse datasets to produce a consensus classification across the cells, tissues, and organs. This has enabled the analysis of the cell type–specific expression across the blood immune cell types as well as the various tissues and organs. A revised classification of all protein-coding genes is presented with regard to both cell and tissue distribution.

The tissue expression profiles described earlier (17) are supported, but the inclusion of the comprehensive single cell type analysis of human blood, together with inclusion of more brain regions and specialized tissue, has changed some of the patterns of tissue specificity. The brain now has the highest number of elevated genes, while testis still has most enriched genes, defined as an expression fourfold higher than that of any other tissue. The inclusion of more cells and tissues has also allowed us to provide evidence for many more genes, and the total number of missing genes with no protein or RNA evidence is now only ~200. For blood cells, a comprehensive list of all proteins showing an enriched expression in the various cell types is presented, confirming well-known protein markers but also identifying interesting targets for in-depth analysis both to study the basic biology of blood cells and to develop new targets for immune-based diagnostics and therapies. The examples presented here illustrate the potential of the Blood Atlas, and its determination of cell type gene enrichment, for the generation of hypotheses from previously unknown differences in cell population expression of important genes in the immune system.

This newly created resource elucidates the gene expression of individual immune cell populations to allow a better understanding of diseases involving the immune system. The emerging technology of single-cell genomics (42, 47) will in the future be a good complement to such studies to identify low abundant cell subpopulations previously not described. Here, we also highlighted the cell type–specific expression of 224 genes associated with primary immunodeficiencies in humans, and we find cell type–specific expression patterns of relevance for their respective clinical phenotype. A large fraction of these genes is expressed in a large number of cell types, enforcing the need to take a holistic, body-wide approach to identify genes of importance for human biology and diseases. To facilitate such studies, we have launched an interactive, open-access Blood Atlas with all the data integrated as part of the Human Protein Atlas, allowing for genome-wide exploration of the protein-coding genes expressed across immune cell populations and in relation to spatial expression patterns in all major human tissues and organs.

Supplementary Materials

science.sciencemag.org/content/366/6472/eaax9198/suppl/DC1

Materials and Methods

Figs. S1 to S21

Tables S1 to S4

References (4861)

Data S1 to S3

References and Notes

Acknowledgments: We thank the nurses at the Coagulation Unit, Karolinska University Hospital, for their assistance in handling donors and sampling. We acknowledge the entire staff of the Human Protein Atlas program and the Science for Life Laboratory for their valuable contributions. Funding: Funding was provided by the Knut and Alice Wallenberg Foundation (WCPR), the Erling Persson Foundation (KCAP), and the Novo Nordisk Foundation (CFB). Support from the National Genomics Infrastructure in Stockholm is acknowledged, with funding from Science for Life Laboratory, the Knut and Alice Wallenberg Foundation, the Swedish Research Council, and SNIC/Uppsala Multidisciplinary Center for Advanced Computational Science. Author contributions: M.U. and L.F. conceived of and designed the study. C.P., J.Mi., T.L., and P.B. performed the single cell sorting and analysis. B.F., F.E., and J.O. collected the clinical samples. M.U., Å.S., A.M., C.L., E.S., J.Mi., B.F., F.E., F.P., A.H., W.Z., M.J.K., J.Mu., A.T., C.Z., and L.F. performed the data analysis. K.v.F., P.O., and M.Z. provided the infrastructure for the data. M.U. drafted the manuscript. M.U., L.F., P.B., M.J.K., and Å.S. revised the manuscript. All authors discussed the results and contributed to the final manuscript. Competing interests: No competing interests. Data and materials availability: All raw flow cytometry data are available at FlowRepository (http://flowrepository.org/) under ID FR-FCM-Z28R. Sequencing data used in the study are available without restriction at the Human Protein Atlas portal (www.proteinatlas.org/about/download). All custom code used for normalization and categorization can be downloaded from Github (https://github.com/human-protein-atlas/BloodAtlas).

Stay Connected to Science

Navigate This Article