Special Reviews

The Sequence of the Human Genome

See allHide authors and affiliations

Science  16 Feb 2001:
Vol. 291, Issue 5507, pp. 1304-1351
DOI: 10.1126/science.1058040
  • Figure 2

    Flow diagram for sequencing pipeline. Samples are received, selected, and processed in compliance with standard operating procedures, with a focus on quality within and across departments. Each process has defined inputs and outputs with the capability to exchange samples and data with both internal and external entities according to defined quality guidelines. Manufacturing pipeline processes, products, quality control measures, and responsible parties are indicated and are described further in the text.

  • Figure 3

    Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and internally derived reads from five different individuals (black lines) are combined to produce a contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) physical map information.

  • Figure 4

    Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation process performing the function indicated by its label, with the labels on arcs between ovals describing the nature of the objects produced and/or consumed by a process. This figure summarizes the discussion in the text that defines the terms and phrases used.

  • Figure 5

    Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of total sequence is indicated.

  • Figure 6

    Comparison of the CSA and the PFP assembly. (A) All of chromosome 21, (B) all of chromosome 8, and (C) a 1-Mb region of chromosome 8 representing a single Celera scaffold. To generate the figure, Celera fragment sequences were mapped onto each assembly. The PFP assembly is indicated in the upper third of each panel; the Celera assembly is indicated in the lower third. In the center of the panel, green lines show Celera sequences that are in the same order and orientation in both assemblies and form the longest consistently ordered run of sequences. Yellow lines indicate sequence blocks that are in the same orientation, but out of order. Red lines indicate sequence blocks that are not in the same orientation. For clarity, in the latter two cases, lines are only drawn between segments of matching sequence that are at least 50 kbp long. The top and bottom thirds of each panel show the extent of Celera mate-pair violations (red, misoriented; yellow, incorrect distance between the mates) for each assembly grouped by library size. (Mate pairs that are within the correct distance, as expected from the mean library insert size, are omitted from the figure for clarity.) Predicted breakpoints, corresponding to stacks of violated mate pairs of the same type, are shown as blue ticks on each assembly axis. Runs of more than 10,000 Ns are shown as cyan bars. Plots of all 24 chromosomes can be seen in Web fig. 3 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1.

  • Figure 7

    Schematic view of the distribution of breakpoints and large gaps on all chromosomes. For each chromosome, the upper pair of lines represent the PFP assembly, and the lower pair of lines represent Celera's assembly. Blue tick marks represent breakpoints, whereas red tick marks represent a gap of larger than 10,000 bp. The number of breakpoints per chromosome is indicated in black, and the chromosome numbers in red.

  • Figure 8

    Analysis of split genes resulting from different annotation methods. A set of 4512 Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by supplying all available evidence to Genscan) were tallied. These data show the degree to which multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq transcript. The zero class for the Otto-homology predictions shown here indicates that the Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call was made because of insufficient evidence.

  • Figure 9

    Comparison of the number of exons per transcript between the 17,968 Otto transcripts and 21,350 de novo transcript predictions with at least one line of evidence that do not overlap with an Otto prediction. Both sets have the highest number of transcripts in the two-exon category, but the de novo gene predictions are skewed much more toward smaller transcripts. In the Otto set, 19.7% of the transcripts have one or two exons, and 5.7% have more than 20. In the de novo set, 49.3% of the transcripts have one or two exons, and 0.2% have more than 20.

  • Figure 10

    Relation between G+C content and gene density. The blue bars show the percent of the genome (in 50-kbp windows) with the indicated G+C content. The percent of the total number of genes associated with each G+C bin is represented by the yellow bars. The graph shows that about 5% of the genome has a G+C content of between 50 and 55%, but that this portion contains nearly 15% of the genes.

  • Figure 11

    Genome structural features. Relation among gene density (orange), G+C content (green), EST density (blue), and Alu density (pink) along the lengths of each of the chromosomes. Gene density was calculated in 1-Mbp win- dows. The percent of G+C nucleotides was calculated in 100-kbp windows. The number of ESTs and Alu elements is shown per 100-kbp window.

  • Figure 12

    Gene duplication in complete protein clusters. The predicted protein sets of human, worm, and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole number) of human versus worm and human versus fly proteins per cluster were plotted.

  • Figure 13

    Segmental duplications between chromosomes in the human genome. The 24 panels show the 1077 duplicated blocks of genes, containing 10,310 pairs of genes in total. Each line represents a pair of homologous genes belonging to a block; all blocks contain at least three genes on each of the chromosomes where they appear. Each panel shows all the duplications between a single chromosome and other chromosomes with shared blocks. The chromosome at the center of each panel is shown as a thick red line for emphasis. Other chromosomes are displayed from top to bottom within each panel ordered by chromosome number. The inset (bottom, center right) shows a close-up of one duplication between chromosomes 18 and 20, expanded to display the gene names of 12 of the 64 gene pairs shown.

  • Figure 14

    SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely accounted for by a coalescent model of regional history.

  • Figure 15

    Distribution of the molecular functions of 26,383 human genes. Each slice lists the numbers and percentages (in parentheses) of human gene functions assigned to a given category of molecular function. The outer circle shows the assignment to molecular function categories in the Gene Ontology (GO) (179), and the inner circle shows the assignment to Celera's Panther molecular function categories (116).

  • Figure 16

    Functions of putative orthologs across vertebrate and invertebrate genomes. Each slice lists the number and percentages (in parentheses) of “strict orthologs” between the human, fly, and worm genomes involved in a given category of molecular function. “Strict orthologs” are defined here as bi-directional BLAST best hits (180) such that each orthologous pair (i) has a BLASTP P-value of ≤10−10(120), and (ii) has a more significant BLASTP score than any paralogs in either organism, i.e., there has likely been no duplication subsequent to speciation that might make the orthology ambiguous. This measure is quite strict and is a lower bound on the number of orthologs. By these criteria, there are 2758 strict human-fly orthologs, and 2031 human-worm orthologs (1523 in common between these sets).

  • Table 1

    Celera-generated data input into assembly.

    IndividualNumber of reads for different insert librariesTotal number of base pairs
    2 kbp10 kbp50 kbpTotal
    No. of sequencing readsA002,767,3572,767,3571,502,674,851
    B11,736,7577,467,75566,93019,271,44210,464,393,006
    C853,819881,29001,735,109942,164,187
    D952,5231,046,81501,999,3381,085,640,534
    F01,498,60701,498,607813,743,601
    Total13,543,09910,894,4672,834,28727,271,85314,808,616,179
    Fold sequence coverageA000.520.52
    (2.9-Gb genome)B2.201.400.013.61
    C0.161.1700.32
    D0.180.2000.37
    F00.2800.28
    Total2.542.040.535.11
    Fold clone coverageA0018.3918.39
    B2.9611.260.4414.67
    C0.221.3301.54
    D0.241.5801.82
    F02.2602.26
    Total3.4216.4318.8438.68
    Insert size* (mean)Average1,951 bp10,800 bp50,715 bp
    Insert size* (SD)Average6.10%8.10%14.90%
    % Mates Average74.5080.8075.60
    • * Insert size and SD are calculated from assembly of mates on contigs.

    • % Mates is based on laboratory tracking of sequencing runs.

  • Table 2

    GenBank data input into assembly.

    CenterStatisticsCompletion phase sequence
    01 and 23
    Whitehead Institute/Number of accession records2,8256,533363
    MIT Center for Number of contigs243,786138,023363
    Genome Research,Total base pairs194,490,1581,083,848,24548,829,358
    USA Total vector masked (bp)1,553,597875,6182,202
    Total contaminant masked (bp)13,654,4824,417,05598,028
    Average contig length (bp)7987,853134,516
    Washington University,Number of accession records193,2321,300
    USANumber of contigs2,12761,8121,300
    Total base pairs1,195,732561,171,788164,214,395
    Total vector masked (bp)21,604270,9428,287
    Total contaminant masked (bp)22,4691,476,141469,487
    Average contig length (bp)5629,079126,319
    Baylor College ofNumber of accession records01,626363
    Medicine, USANumber of contigs044,861363
    Total base pairs0265,547,06649,017,104
    Total vector masked (bp)0218,7694,960
    Total contaminant masked (bp)01,784,700485,137
    Average contig length (bp)05,919135,033
    Production SequencingNumber of accession records1352,043754
    Facility, DOE Joint Number of contigs7,05234,938754
    Genome Institute,Total base pairs8,680,214294,249,63160,975,328
    USATotal vector masked (bp)22,644162,6517,274
    Total contaminant masked (bp)665,8184,642,372118,387
    Average contig length (bp)1,2318,42280,867
    The Institute of PhysicalNumber of accession records01,149300
    and ChemicalNumber of contigs025,772300
    Research (RIKEN),Total base pairs0182,812,27520,093,926
    Japan Total vector masked (bp)0203,7922,371
    Total contaminant masked (bp)0308,42627,781
    Average contig length (bp)07,09366,978
    Sanger Centre, UKNumber of accession records04,5382,599
    Number of contigs074,3242,599
    Total base pairs0689,059,692246,118,000
    Total vector masked (bp)0427,32625,054
    Total contaminant masked (bp)02,066,305374,561
    Average contig length (bp)09,27194,697
    Others* Number of accession records421,8943,458
    Number of contigs5,97829,8983,458
    Total base pairs5,564,879283,358,877246,474,157
    Total vector masked (bp)57,448 279,47732,136
    Total contaminant masked (bp)575,3661,616,6651,791,849
    Average contig length (bp)9319,47871,277
    All centers combined Number of accession records3,02121,0159,137
    Number of contigs258,943409,6289,137
    Total base pairs209,930,9833,360,047,574835,722,268
    Total vector masked (bp)1,655,2932,438,57582,284
    Total contaminant masked (bp)14,918,13516,311,6643,365,230
    Average contig length (bp)8118,20391,466
    • * Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center; Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence Livermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuer Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas Southwestern Medical Center, University of Washington.

    • The 4,405,700,825 bases contributed by all centers were shredded into faux reads resulting in 2.96× coverage of the genome.

  • Table 3

    Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.

    Scaffold size
    All>30 kbp>100 kbp>500 kbp>1000 kbp
    Compartmentalized shotgun assembly
    No. of bp in scaffolds2,905,568,2032,748,892,4302,700,489,9062,489,357,2602,248,689,128
    (including intrascaffold gaps)
    No. of bp in contigs2,653,979,7332,524,251,3022,491,538,3722,320,648,2012,106,521,902
    No. of scaffolds53,5912,8451,9351,060721
    No. of contigs170,033112,207107,19993,13882,009
    No. of gaps116,442109,362105,26492,07881,288
    No. of gaps ≤1 kbp72,09169,17567,28959,91553,354
    Average scaffold size (bp)54,217966,2191,395,6022,348,4503,118,848
    Average contig size (bp)15,60922,49623,24224,91625,686
    Average intrascaffold gap size (bp)2,1612,0541,9851,8321,749
    Largest contig (bp)1,988,3211,988,3211,988,3211,988,3211,988,321
    % of total contigs10095948779
    Whole-genome assembly
    No. of bp in scaffolds (including intrascaffold gaps)2,847,890,3902,574,792,6182,525,334,4472,328,535,4662,140,943,032
    No. of bp in contigs2,586,634,1082,334,343,3392,297,678,9352,143,002,1841,983,305,432
    No. of scaffolds118,9682,5071,637818554
    No. of contigs221,03699,18995,49484,64176,285
    No. of gaps102,06896,68293,85783,82375,731
    No. of gaps ≤1 kbp62,35660,34359,15654,07949,592
    Average scaffold size (bp)23,9381,027,0411,542,6602,846,6203,864,518
    Average contig size (bp)11,70223,53424,06125,31925,999
    Average intrascaffold gap size (bp)2,5602,4872,4262,2132,082
    Largest contig (bp)1,224,0731,224,0731,224,0731,224,0731,224,073
    % of total contigs10090898377
  • Table 4

    Summary of scaffold mapping. Scaffolds were mapped to the genome with different levels of confidence (anchored scaffolds have the highest confidence; unmapped scaffolds have the lowest). Anchored scaffolds were consistently ordered by the WashU BAC map and GM99. Ordered scaffolds were consistently ordered by at least one of the following: the WashU BAC map, GM99, or component tiling path. Bounded scaffolds had order conflicts between at least two of the external maps, but their placements were adjacent to a neighboring anchored or ordered scaffold. Unmapped scaffolds had, at most, a chromosome assignment. The scaffold subcategories are given below each category.

    Mapped scaffold category NumberLength (bp)% Total length
    Anchored1,5261,860,676,67670
    Oriented1,2461,852,088,64570
    Unoriented2808,588,0310.3
    Ordered2,001369,235,85714
    Oriented839329,633,16612
    Unoriented1,16239,602,6912
    Bounded38,241368,753,46314
    Oriented7,453274,536,42410
    Unoriented30,78894,217,0394
    Unmapped11,82355,313,7372
    Known2812,505,8440.1
    chromosome
    Unknown chromosome11,54252,807,8932
  • Table 5

    Mate-pair validation. Celera fragment sequences were mapped to the published sequence of chromosome 21. Each mate pair uniquely mapped was evaluated for correct orientation and placement (number of mate pairs tested). If the two mates had incorrect relative orientation or placement, they were considered invalid (number of invalid mate pairs).

    Library typeLibrary no.Chromosome 21Genome
    Mean insert size (bp)SD (bp)SD/mean (%)No. of mate pairs testedNo. of invalid mate pairs% invalidMean insert size (bp)SD (bp)SD/ mean (%)
    2 kbp12,0811065.13,642 381.02,082904.3
    21,9131527.928,0294131.51,9231186.1
    32,1661758.14,405 571.32,1621587.3
    10 kbp411,3858517.54,319 801.911,3706966.1
    514,5231,87512.97,3551562.114,1421,4029.9
    69,6351,03510.75,5731092.09,6069349.7
    710,2239289.134,0793991.210,1907777.6
    50 kbp864,8882,7474.216  16.365,5005,5048.4
    953,4105,83410.991417018.653,3115,54610.4
    1052,0347,31214.15,8715699.751,4986,58812.8
    1152,2827,45414.32,6292138.152,2827,45414.3
    1246,6167,37815.82,15321510.045,4189,06820.0
    1355,78810,09918.12,24424911.153,06210,89320.5
    1439,8945,01912.6199  73.536,8389,98827.1
    BES1548,9319,81320.1144 106.947,8454,77410.0
    1648,1304,2328.8195 147.247,9244,5819.6
    17106,02727,77826.2330 164.8152,00026,60017.5
    18160,57554,97334.2155  85.2161,75027,00016.7
    19164,15519,45311.9642 446.9176,50019,50011.05
    Sum102,8942,7682.7
    (mean = 2.7)
  • Table 6

    Genome-wide mate pair analysis of compartmentalized shotgun (CSA) and PFP assemblies.*

    Genome libraryCSAPFP
    % valid% mis-oriented% mis-separated% valid% mis-oriented% mis-separated
    2 kbp98.50.61.095.72.02.3
    10 kbp96.71.02.381.99.68.6
    50 kbp93.94.51.564.222.313.5
    BES94.12.13.862.019.318.8
    Mean97.41.01.687.36.85.9
  • Table 7

    Sensitivity and specificity of Otto and Genscan. Sensitivity and specificity were calculated by first aligning the prediction to the published RefSeq transcript, tallying the number (N) of uniquely aligned RefSeq bases. Sensitivity is the ratio of N to the length of the published RefSeq transcript. Specificity is the ratio of N to the length of the prediction. All differences are significant (Tukey HSD; P < 0.001).

    MethodSensitivitySpecificity
    Otto (RefSeq only)* 0.9390.973
    Otto (homology) 0.6040.884
    Genscan0.5010.633
    • * Refers to those annotations produced by Otto using only the Sim4-polished RefSeq alignment rather than an evidence-based Genscan prediction.

    • Refers to those annotations produced by supplying all available evidence to Genscan.

  • Table 8

    Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions).

    TotalTypes of evidenceNo. of lines of evidence*
    MouseRodentProteinHuman≥1≥2≥3≥4
    OttoNumber of transcripts17,96917,06514,88115,47716,374 17,968 17,50115,87712,451
    Number of exons141,218111,17489,569108,431118,869140,710 127,95599,57459,804
    De novoNumber of transcripts58,03214,4635,0948,0439,220  21,350  8,619 4,9471,904
    Number of exons319,93548,59419,34426,26440,104 79,148 31,13017,5086,520
    No. of exons perOtto7.845.776.016.997.247.817.196.004.28
    transcriptDe novo5.533.173.803.274.36 3.73.563.423.16
    • * Four kinds of evidence (conservation in 3× mouse genomic DNA, similarity to human EST or cDNA, similarity to rodent EST or cDNA, and similarity to known proteins) were considered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript.

    • This number includes alternative splice forms of the 17,764 genes mentioned elsewhere in the text.

  • Table 9

    Characteristics of G+C in isochores.

    IsochoreG+C (%)Fraction of genomeFraction of genes
    Predicted*ObservedPredicted*Observed
    H3>4859.53724.8
    H1/H243–482521.23226.6
    L<436769.23148.5
    • * The predictions were based on Bernardi's definitions (70) of the isochore structure of the human genome.

  • Table 10

    Features of the chromosomes. De novo/any refers to the union of de novo predictions that do not overlap Otto predictions and have at least one other type of supporting evidence; de novo/2x refers to the union of de novo predictions that do not overlap Otto predictions and have at least two types of evidence. Deserts are regions of sequence with no annotated genes.

    Chr.Sequence coverage (CS assembly)Base compositionGene prediction*Gene density (genes/Mbp)
    Size (Mbp)No. of scaf- foldsLargest scaf- fold (Mbp)No. of scaf- folds >500 kbpSe- quence covered by scaf- folds >500 kbp% of total se- quence in scaf-folds >500 kbp% repeat% GCNo of CpG islandsOttoDe novo/anyDe novo/2×Total (Otto + de novo/any)Total (Otto + de novo/any)Se- quence in deserts >500/kbp Se- quence in deserts >1 Mbp Otto De novo/ anyDe novo/2× Otto + de novo/ any Otto + de novo/ 2×
     12202,54911821928837422,3351,7431,7107103,4532,4532968831611
     22403,26313782179136401,7031,1831,7716332,9541,8165519572127
     32003,5327781738737401,2711,0131,4145982,4271,6115012573128
     41862,18010701699137381,0816961,1654491,8611,1455518462106
     51823,23111631638937401,3028921,2444742,1361,3664615572117
     61721,71313581609337401,3849431,3145242,2571,467389673138
     71461,32614531308938401,4067591,0724601,8311,2192612573128
     81461,77211541359236409485839773571,560940336472116
     91131,6168401018938411,3156898483291,5371,018229673138
    101302,0059551168936421,0876859683421,6531,027218572127
    111322,8149441168839421,4611,0511,1345352,1851,5862798841612
    121342,6148511178738411,1319259364171,8611,3422497731410
    13991,0381334919136386443416912411,0325823116472105
    14875761116839540419135837002901,28387334207831410
    15801,747831708737427225586402461,198804817831510
    16751,520827628240441,5337486732471,42199513310931912
    17781,683640617839451,4898976483131,5451,21015612841915
    18791,3331318729236405102835431898264722110472106
    19582,282331386757492,8041,1415342681,6751,4093020942923
    2061580141758944144997517469180986697718731611
    213335810632963841519184265102449286159683138
    22363331112328844481,1734943411478356413014942317
    X1281,346491937346397266058603871,465992298563117
    Y196382101265503965551554921010442382115
    U* 7511,5421479196278132474328
    Total290753,5911,0592,49028,51917,76421,3508,61939,11426,383606208
    Avg.1162,1449441048740411,1607148123331,5261,047259773149
    • * Chromosomal assignment unknown.

  • Table 11

    Genome overview.

    Size of the genome (including gaps)2.91 Gbp
    Size of the genome (excluding gaps)2.66 Gbp
    Longest contig1.99 Mbp
    Longest scaffold14.4 Mbp
    Percent of A+T in the genome54
    Percent of G+C in the genome38
    Percent of undetermined bases in the genome9
    Most GC-rich 50 kbChr. 2 (66%)
    Least GC-rich 50 kbChr. X (25%)
    Percent of genome classified as repeats35
    Number of annotated genes26,383
    Percent of annotated genes with unknown function42
    Number of genes (hypothetical and annotated)39,114
    Percent of hypothetical and annotated genes with unknown function59
    Gene with the most exonsTitin (234 exons)
    Average gene size27 kbp
    Most gene-rich chromosomeChr. 19 (23 genes/Mb)
    Least gene-rich chromosomesChr. 13 (5 genes/Mb), Chr. Y (5 genes/Mb)
    Total size of gene deserts (>500 kb with no annotated genes)605 Mbp
    Percent of base pairs spanned by genes25.5 to 37.8*
    Percent of base pairs spanned by exons1.1 to 1.4*
    Percent of base pairs spanned by introns24.4 to 36.4*
    Percent of base pairs in intergenic DNA74.5 to 63.6*
    Chromosome with highest proportion of DNA in annotated exonsChr. 19 (9.33)
    Chromosome with lowest proportion of DNA in annotated exonsChr. Y (0.36)
    Longest intergenic region (between annotated + hypothetical genes)Chr. 13 (3,038,416 bp)
    Rate of SNP variation1/1250 bp
    • * In these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical + annotated gene set (39,114 genes), respectively.

  • Table 12

    Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markers were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated in 3-Mb windows for each chromosome. NA, not applicable.

    Chrom.MaleSex-averageFemale
    Max.Avg.Min.Max.Avg.Min.Max.Avg.Min.
     12.601.120.232.811.420.523.391.760.68
     22.230.780.332.651.120.543.171.400.61
     32.550.860.232.401.070.422.711.300.33
     41.660.670.152.061.040.602.501.400.77
     52.000.670.181.871.080.422.261.430.62
     61.970.710.282.571.120.373.471.670.64
     72.341.160.481.671.170.472.271.210.34
     81.830.730.142.401.050.463.441.360.43
     92.010.990.531.951.320.772.631.660.82
    103.731.030.223.051.290.662.841.510.76
    111.430.720.312.130.990.473.101.320.49
    124.120.760.263.351.160.492.931.550.59
    131.600.750.011.870.950.172.491.190.32
    143.150.980.182.651.300.623.141.630.75
    152.280.940.342.311.220.422.531.560.54
    161.831.000.472.701.550.634.992.321.12
    173.870.870.003.541.350.544.191.830.94
    183.121.370.863.751.660.434.352.240.72
    193.020.970.102.571.410.492.891.750.87
    203.640.890.002.791.500.833.312.151.34
    213.231.260.692.371.621.082.581.901.18
    221.251.100.841.881.411.083.732.080.93
    XNANANANANANA3.121.640.72
    YNANANANANANANANANA
    Genome4.120.880.003.751.220.174.991.550.32
  • Table 13

    Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the whole genome (2.9-Gbp sequence length) by means of two different methods. Method 1 uses a CG likelihood ratio of ≥0.6. Method 2 uses a CG likelihood ratio of ≥0.8.

    Chromosome 22Whole genome (CS assembly)
    Method 1Method 2Method 1Method 2
    Number of CpG islands detected5,211522195,70626,876
    Average length of island (bp)390535395497
    Percent of sequence predicted as CpG5.90.82.60.4
    Percent of first exons that overlap a CpG island44254222
    Percent of first exons with first position of exon contained inside a CpG island37224021
    Average distance between first exon and closest CpG island (bp)1,01310,4862,18217,021
    Expected distance between first exon and closest CpG island (bp)3,26232,5677,16455,811
  • Table 14

    Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence.

    Repetitive elementsMegabases in assembled sequencesPercent of assemblyPreviously predicted (%) (83)
    Alu2889.910.0
    Mammalian interspersed repeat (MIR)662.31.7
    Medium reiteration (MER)501.71.6
    Long terminal repeat (LTR)1555.35.6
    Long interspersed nucleotide element (LINE)46616.116.7
    Total102535.335.6
  • Table 15

    Overlap of SNPs from genome-wide SNP databases. Table entries are SNP counts for each pair of data sets. Numbers in parentheses are the fraction of overlap, calculated as the count of overlapping SNPs divided by the number of SNPs in the smaller of the two databases compared. Total SNP counts for the databases are: Celera-PFP, 2,104,820; TSC, 585,811; and Kwok 438,032. Only unique SNPs in the TSC and Kwok data sets were included.

    TSCKwok
    Celera-PFP188,694158,532
    (0.322)(0.362)
    TSC72,024
    (0.164)
  • Table 16

    Summary of nucleotide changes in different SNP data sets.

    SNP data setA/G (%)C/T (%)A/C (%)A/T (%)C/G (%)T/G (%)Transition: transversion
    Celera-PFP30.730.710.38.69.210.31.59:1
    Kwok* 33.733.88.57.08.68.42.07:1
    TSC 33.333.48.87.38.68.61.99:1
    • * November 2000 release of the NCBI database dbSNP (www.nci.nlm.nih.gov/SNP/) with the method defined as Overlap SnpDetectionWithPolyBayes. The submitter of the data is Pui-Yan Kwok from Washington University.

    • November 2000 release of NCBI dbSNP (www.ncbi.nlm.nih.gov/SNP/) with the methods defined as TSC-Sanger, TSC-WICGR, and TSC-WUGSC. The submitter of the data is Lincoln Stein from Cold Spring Harbor Laboratory.

  • Table 17

    Distribution of SNPs in classes of genomic regions.

    Genomic region classSize of region examined (Mb)Celera-PFP SNP density (SNP/Mb)
    Intergenic2185707
    Gene (intron + exon)646917
    Intron615921
    First intron164808
    Exon31529
    First exon10592
  • Table 18

    Domain-based comparative analysis of proteins in H. sapiens (H), D. melanogaster (F),C. elegans (W), S. cerevisiae (Y), and A. thaliana (A). The predicted protein set of each of the above eukaryotic organisms was analyzed with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins containing the specified Pfam domains as well as the total number of domains (in parentheses) are shown in each column. Domains were categorized into cellular processes for presentation. Some domains (i.e., SH2) are listed in more than one cellular process. Results of the Pfam analysis may differ from results obtained based on human curation of protein families, owing to the limitations of large-scale automatic classifications. Representative examples of domains with reduced counts owing to the stringent E value cutoff used for this analysis are marked with a double asterisk (**). Examples include short divergent and predominantly alpha-helical domains, and certain classes of cysteine-rich zinc finger proteins.

    Accession numberDomain nameDomain descriptionHFWYA
    Developmental and homeostatic regulators
    PF02039AdrenomedullinAdrenomedullin10000
    PF00212ANPAtrial natriuretic peptide20000
    PF00028CadherinCadherin domain100 (550)14 (157)16 (66)00
    PF00214Calc_CGRP_IAPPCalcitonin/CGRP/IAPP family30000
    PF01110CNTFCiliary neurotrophic factor10000
    PF01093ClusterinClusterin30000
    PF00029ConnexinConnexin14 (16)0000
    PF00976ACTH_domainCorticotropin ACTH domain10000
    PF00473CRFCorticotropin-releasing factor family21000
    PF00007Cys_knotCystine-knot domain10 (11)2000
    PF00778DIXDix domain52400
    PF00322EndothelinEndothelin family30000
    PF00812EphrinEphrin7 (8)2400
    PF01404EPh_IbdEphrin receptor ligand binding domain122100
    PF00167FGFFibroblast growth factor231100
    PF01534FrizzledFrizzled/Smoothened family membrane region97300
    PF00236Hormone6Glycoprotein hormones10000
    PF01153GlypicanGlypican142100
    PF01271GraninGrainin (chromogranin or secretogranin)30000
    PF02058GuanylinGuanylin precursor10000
    PF00049InsulinInsulin/IGF/Relaxin family74000
    PF00219IGFBPInsulin-like growth factor binding proteins100000
    PF02024LeptinLeptin10000
    PF00193XlinkLINK (hyaluron binding)13 (23)0100
    PF00243NGFNerve growth factor family30000
    PF02158NeuregulinNeuregulin family40000
    PF00184Hormone5Neurohypophysial hormones10000
    PF02070NMUNeuromedin U10000
    PF00066NotchNotch (DSL) domain3 (5)2 (4)2 (6)00
    PF00865OsteopontinOsteopontin10000
    PF00159Hormone3Pancreatic hormone peptides30000
    PF01279ParathyroidParathyroid hormone family20000
    PF00123Hormone2Peptide hormone5 (9)0000
    PF00341PDGFPlatelet-derived growth factor (PDGF)51000
    PF01403SemaSema domain27 (29)8 (10)3 (4)00
    PF01033Somatomedin_BSomatomedin B domain5 (8)3000
    PF00103HormoneSomatotropin10000
    PF02208SorbSorbin homologous domain20000
    PF02404SCFStem cell factor20000
    PF01034SyndecanSyndecan domain31100
    PF00020TNFR_c6TNFR/NGFR cysteine-rich region17 (31)1000
    PF00019TGF-βTransforming growth factor β-like domain27 (28)6400
    PF01099UteroglobinUteroglobin family30000
    PF01160Opiods_neuropepVertebrate endogenous opioids neuropeptide30000
    PF00110WntWnt family of developmental signaling proteins187 (10)500
    Hemostasis
    PF01821ANATOAnaphylotoxin-like domain6 (14)0000
    PF00386C1qC1q domain240000
    PF00200DisintegrinDisintegrin182300
    PF00754F5_F8_type_CF5/8 type C domain15 (20)5 (6)200
    PF01410COLFIFibrillar collagen C-terminal domain100000
    PF00039Fn1Fibronectin type I domain5 (18)0000
    PF00040Fn2Fibronectin type II domain11 (16)0000
    PF00051KringleKringle domain15 (24)2200
    PF01823MACPFMAC/Perforin domain60000
    PF00354PentaxinPentaxin family90000
    PF00277SAA_proteinsSerum amyloid A protein40000
    PF00084SushiSushi domain (SCR repeat)53 (191)11 (42)8 (45)00
    PF02210TSPNThrombospondin N-terminal–like domains141000
    PF01108Tissue_facTissue factor10000
    PF00868Transglutamin_NTransglutaminase family61000
    PF00927Transglutamin_CTransglutaminase family81000
    Accession numberDomain nameDomain descriptionHFWYA
    PF00594GlaVitamin K-dependent carboxylation/gamma- carboxyglutamic (GLA) domain110000
    Immune response
    PF00711Defensin_betaBeta defensin10000
    PF00748Calpain_inhibCalpain inhibitor repeat3 (9)0000
    PF00666CathelicidinsCathelicidins20000
    PF00129MHC_IClass I histocompatibility antigen, domains alpha 1 and 218 (20)0000
    PF00993MHC_II_alpha** Class II histocompatibility antigen, alpha domain5 (6)0000
    PF00969MHC_II_beta** Class II histocompatibility antigen, beta domain70000
    PF00879Defensin_propepDefensin propeptide30000
    PF01109GM_CSFGranulocyte-macrophage colony-stimulating factor10000
    PF00047IgImmunoglobulin domain381 (930)125 (291)67 (323)00
    PF00143InterferonInterferon alpha/beta domain7 (9)0000
    PF00714IFN-gammaInterferon gamma10000
    PF00726IL10Interleukin-1010000
    PF02372IL15Interleukin-1510000
    PF00715IL2Interleukin-210000
    PF00727IL4Interleukin-410000
    PF02025IL5Interleukin-510000
    PF01415IL7Interleukin-7/9 family10000
    PF00340IL1Interleukin-170000
    PF02394IL1_propepInterleukin-1 propeptide10000
    PF02059IL3Interleukin-310000
    PF00489IL6Interleukin-6/G-CSF/MGF family20000
    PF01291LIF_OSMLeukemia inhibitory factor (LIF)/oncostatin (OSM) family20000
    PF00323DefensinsMammalian defensin20000
    PF01091PTN_MKPTN/MK heparin-binding protein20000
    PF00277SAA_proteinsSerum amyloid A protein40000
    PF00048IL8Small cytokines (intecrine/chemokine), interleukin-8 like320000
    PF01582TIRTIR domain18820131 (143)
    PF00229TNFTNF (tumor necrosis factor) family120000
    PF00088TrefoilTrefoil (P-type) domain5 (6)0200
    PI-PY-rho GTPase signaling
    PF00779BTKBTK motif51000
    PF00168C2C2 domain73 (101)32 (44)24 (35)6 (9)66 (90)
    PF00609DAGKaDiacylglycerol kinase accessory domain (presumed)94706
    PF00781DAGKcDiacylglycerol kinase catalytic domain (presumed)1088211 (12)
    PF00610DEPDomain found in Dishevelled, Egl-10, and Pleckstrin (DEP)12 (13)41052
    PF01363FYVEFYVE zinc finger28 (30)1415515
    PF00996GDIGDP dissociation inhibitor62113
    PF00503G-alphaG-protein alpha subunit27 (30)1020 (23)25
    PF00631G-gammaG-protein gamma like domains165510
    PF00616RasGAPGTPase-activator protein for Ras-like GTPase115830
    PF00618RasGEFNGuanine nucleotide exchange factor for Ras-like GTPases; N-terminal motif92350
    PF00625Guanylate_kinGuanylate kinase128714
    PF02189ITAMImmunoreceptor tyrosine-based activation motif30000
    PF00169PHPH domain193 (212)72 (78)65 (68)2423
    PF00130DAG_PE-bindPhorbol esters/diacylglycerol binding domain (C1 domain)45 (56)25 (31)26 (40)1 (2)4
    PF00388PI-PLC-XPhosphatidylinositol-specific phospholipase C, X domain123718
    PF00387PI-PLC-YPhosphatidylinositol-specific phospholipase C, Y domain112718
    PF00640PIDPhosphotyrosine interaction domain (PTB/PID)24 (27)1311 (12)00
    PF02192PI3K_p85BPI3-kinase family, p85-binding domain21100
    PF00794PI3K_rbdPI3-kinase family, ras-binding domain63100
    PF01412ArfGAPPutative GTP-ase activating protein for Arf1698615
    PF02196RBDRaf-like Ras-binding domain6 (7)4100
    PF02145Rap_GAPRap/ran-GAP54200
    PF00788RARas association (RalGDS/AF-6) domain18 (19)7 (9)610
    PF00071RasRas family12656 (57)512378
    PF00617RasGEFRasGEF domain218750
    PF00615RGSRegulator of G protein signaling domain276 (7)12 (13)10
    PF02197RIIaRegulatory subunit of type II PKA R-subunit41210
    Accession numberDomain nameDomain descriptionHFWYA
    PF00620RhoGAPRhoGAP domain59192098
    PF00621RhoGEFRhoGEF domain4623 (24)18 (19)30
    PF00536SAMSAM domain (Sterile alpha motif)29 (31)15836
    PF01369Sec7Sec7 domain135559
    PF00017SH2Src homology 2 (SH2) domain87 (95)33 (39)44 (48)13
    PF00018SH3Src homology 3 (SH3) domain143 (182)55 (75)46 (61)23 (27)4
    PF01017STATSTAT protein711 (2)00
    PF00790VHSVHS domain42448
    PF00568WH1WH1 domain722 (3)10
    Domains involved in apoptosis
    PF00452Bcl-2Bcl-292100
    PF02180BH4Bcl-2 homology region 430100
    PF00619CARDCaspase recruitment domain160200
    PF00531DeathDeath domain165700
    PF01335DEDDeath effector domain4 (5)0000
    PF02179BAGDomain present in Hsp70 regulators5 (8)3215
    PF00656ICE_p20ICE-like protease (caspase) p20 domain117300
    PF00653BIRInhibitor of Apoptosis domain8 (14)5 (9)2 (3)1 (2)0
    Cytoskeletal
    PF00022ActinActin61 (64)15 (16)129 (11)24
    PF00191AnnexinAnnexin16 (55)4 (16)4 (11)06 (16)
    PF00402CalponinCalponin family13 (22)37 (19)00
    PF00373Band_41FERM domain (Band 4.1 family)29 (30)17 (19)11 (14)00
    PF00880Nebulin_repeatNebulin repeat4 (148)1 (2)100
    PF00681Plectin_repeatPlectin repeat2 (11)0000
    PF00435SpectrinSpectrin repeat31 (195)13 (171)10 (93)00
    PF00418Tubulin-bindingTau and MAP proteins, tubulin-binding4 (12)1 (4)2 (8)00
    PF00992TroponinTroponin46800
    PF02209VHPVillin headpiece domain52205
    PF01044VinculinVinculin family42100
    ECM adhesion
    PF01391CollagenCollagen triple helix repeat (20 copies)65 (279)10 (46)174 (384)00
    PF01413C4C-terminal tandem repeated domain in type 4 procollagen6 (11)2 (4)3 (6)00
    PF00431CUBCUB domain47 (69)9 (47)43 (67)00
    PF00008EGFEGF-like domain108 (420)45 (186)54 (157)01
    PF00147Fibrinogen_CFibrinogen beta and gamma chains, C-terminal globular domain2610 (11)600
    PF00041Fn3Fibronectin type III domain106 (545)42 (168)34 (156)01
    PF00757Furin-likeFurin-like cysteine rich region52100
    PF00357Integrin_AIntegrin alpha cytoplasmic region31200
    PF00362Integrin_BIntegrins, beta chain82200
    PF00052Laminin_BLaminin B (Domain IV)8 (12)4 (7)6 (10)00
    PF00053Laminin_EGFLaminin EGF-like (Domains III and V)24 (126)9 (62)11 (65)00
    PF00054Laminin_GLaminin G domain30 (57)18 (42)14 (26)00
    PF00055Laminin_NtermLaminin N-terminal (Domain VI)106400
    PF00059Lectin_cLectin C-type domain47 (76)23 (24)91 (132)00
    PF01463LRRCTLeucine rich repeat C-terminal domain69 (81)23 (30)7 (9)00
    PF01462LRRNTLeucine rich repeat N-terminal domain40 (44)7 (13)3 (6)00
    PF00057Ldl_recept_aLow-density lipoprotein receptor domain class A35 (127)33 (152)27 (113)00
    PF00058Ldl_recept_bLow-density lipoprotein receptor repeat class B15 (96)9 (56)7 (22)00
    PF00530SRCRScavenger receptor cysteine-rich domain11 (46)4 (8)1 (2)00
    PF00084SushiSushi domain (SCR repeat)53 (191)11 (42)8 (45)00
    PF00090Tsp_1Thrombospondin type 1 domain41 (66)11 (23)18 (47)00
    PF00092Vwavon Willebrand factor type A domain34 (58)017 (19)01
    PF00093Vwcvon Willebrand factor type C domain19 (28)6 (11)2 (5)00
    PF00094Vwdvon Willebrand factor type D domain15 (35)3 (7)900
    Protein interaction domains
    PF0024414-3-314-3-3 proteins2033215
    PF00023AnkAnk repeat145 (404)72 (269)75 (223)12 (20)66 (111)
    PF00514Armadillo_segArmadillo/beta-catenin-like repeats22 (56)11 (38)3 (11)2 (10)25 (67)
    PF00168C2C2 domain73 (101)32 (44)24 (35)6 (9)66 (90)
    PF00027cNMP_bindingCyclic nucleotide-binding domain26 (31)21 (33)15 (20)2 (3)22
    PF01556DnaJ_CDnaJ C terminal region1295319
    PF00226DnaJDnaJ domain4434332093
    PF00036Efhand** EF hand83 (151)64 (117)41 (86)4 (11)120 (328)
    PF00611FCHFes/CIP4 homology domain93240
    PF01846FFFF domain4 (11)4 (10)3 (16)2 (5)4 (8)
    PF00498FHAFHA domain1315713 (14)17
    Accession numberDomain nameDomain descriptionHFWYA
    PF00254FKBPFKBP-type peptidyl-prolyl cis-trans isomerases15 (20)7 (8)7 (13)424 (29)
    PF01590GAFGAF domain7 (8)2 (4)1010
    PF01344KelchKelch motif54 (157)12 (48)13 (41)3102 (178)
    PF00560LRR** Leucine Rich Repeat25 (30)24 (30)7 (11)115 (16)
    PF00917MATHMATH domain11588 (161)161 (74)
    PF00989PASPAS domain18 (19)9 (10)6113 (18)
    PF00595PDZPDZ domain (Also known as DHR or GLGF)96 (154)60 (87)46 (66)25
    PF00169PHPH domain193 (212)72 (78)65 (68)2423
    PF01535PPR** PPR repeat53 (4)01474 (2485)
    PF00536SAMSAM domain (Sterile alpha motif)29 (31)15836
    PF01369Sec7Sec7 domain135559
    PF00017SH2Src homology 2 (SH2) domain87 (95)33 (39)44 (48)13
    PF00018SH3Src homology 3 (SH3) domain143 (182)55 (75)46 (61)23 (27)4
    PF01740STASSTAS domain516213
    PF00515TPR** TPR domain72 (131)39 (101)28 (54)16 (31)65 (124)
    PF00400WD40** WD40 domain136 (305)98 (226)72 (153)56 (121)167 (344)
    PF00397WWWW domain32 (53)24 (39)16 (24)5 (8)11 (15)
    PF00569ZZZZ-Zinc finger present in dystrophin, CBP/p30010 (11)1310210
    Nuclear interaction domains
    PF01754Zf-A20A20-like zinc finger2 (8)2208
    PF01388ARIDARID DNA binding domain116427
    PF01426BAHBAH domain8 (10)7 (8)4 (5)521 (25)
    PF00643Zf-B_box** B-box zinc finger32 (35)1200
    PF00533BRCTBRCA1 C Terminus (BRCT) domain17 (28)10 (18)23 (35)10 (16)12 (16)
    PF00439BromodomainBromodomain37 (48)16 (22)18 (26)10 (15)28
    PF00651BTBBTB/POZ domain97 (98)62 (64)86 (91)1 (2)30 (31)
    PF00145DNA_methylaseC-5 cytosine-specific DNA methylase3 (4)10013 (15)
    PF00385Chromochromo' (CHRromatin Organization MOdifier) domain24 (27)14 (15)17 (18)1 (2)12
    PF00125HistoneCore histone H2A/H2B/H3/H475 (81)571 (73)848
    PF00134CyclinCyclin1910101135
    PF00270DEADDEAD/DEAH box helicase63 (66)48 (50)55 (57)50 (52)84 (87)
    PF01529Zf-DHHCDHHC zinc finger domain152016722
    PF00646F-box** F-box domain1615309 (324)9165 (167)
    PF00250Fork_headFork head domain35 (36)20 (21)1540
    PF00320GATAGATA zinc finger11 (17)5(6)8 (10)926
    PF01585G-patchG-patch domain181613414 (15)
    PF00010HLH** Helix-loop-helix DNA-binding domain60 (61)4424439
    PF00850Hist_deacetylHistone deacetylase family125 (6)8 (10)510
    PF00046HomeoboxHomeobox domain160 (178)100 (103)82 (84)666
    PF01833TIGIPT/TIG domain29 (53)11 (13)5 (7)21
    PF02373JmjCJmjC domain104647
    PF02375JmjNJmjN domain74237
    PF00013KH-domainKH domain28 (67)14 (32)17 (46)4 (14)27 (61)
    PF01352KRABKRAB box204 (243)0000
    PF00104Hormone_recLigand-binding domain of nuclear hormone receptor4717142 (147)00
    PF00412LIMLIM domain containing proteins62 (129)33 (83)33 (79)4 (7)10 (16)
    PF00917MATHMATH domain11588 (161)161 (74)
    PF00249Myb_DNA-bindingMyb-like DNA-binding domain32 (43)18 (24)17 (24)15 (20)243 (401)
    PF02344Myc-LZMyc leucine zipper domain10000
    PF01753Zf-MYNDMYND finger1414917
    PF00628PHDPHD-finger68 (86)40 (53)32 (44)14 (15)96 (105)
    PF00157PouPou domain—N-terminal to homeobox domain155400
    PF02257RFX_DNA_bindingRFX DNA-binding domain72110
    PF00076RrmRNA recognition motif (a.k.a. RRM, RBD, or RNP domain)224 (324)127 (199)94 (145)43 (73)232 (369)
    PF02037SAPSAP domain 158556 (7)
    PF00622SPRYSPRY domain44 (51)10 (12)5 (7)36
    PF01852STARTSTART domain1026023
    PF00907T-boxT-box17 (19)82200
    Accession numberDomain nameDomain descriptionHFWYA
    PF02135Zf-TAZTAZ finger2 (3)1 (2)6 (7)010 (15)
    PF01285TEATEA domain41110
    PF02176Zf-TRAFTRAF-type zinc finger6 (9)1 (3)102
    PF00352TBPTranscription factor TFIID (or TATA-binding protein, TBP)2 (4)4 (8)2 (4)1 (2)2 (4)
    PF00567TUDORTUDOR domain9 (24)9 (19)4 (5)02
    PF00642Zf-CCCHZinc finger C-x8-C-x5-C-x3-H type (and similar)17 (22)6 (8)22 (42)3 (5)31 (46)
    PF00096Zf-C2H2** ZInc finger, C2H2 type564 (4500)234 (771)68 (155)34 (56)21 (24)
    PF00097Zf-C3HC4Zinc finger, C3HC4 type (RING finger)135 (137)5788 (89)18298 (304)
    PF00098Zf-CCHCZinc knuckle9 (17)6 (10)17 (33)7 (13)68 (91)
  • Table 19

    Number of proteins assigned to selected Panther families or subfamilies in H. sapiens (H), D. melanogaster (F), C. elegans (W), S. cerevisiae (Y), and A. thaliana(A).

    Panther family/subfamily*HFWYA
    Neural structure, function, development
    Ependymin10000
    Ion channels
    Acetylcholine receptor17125600
    Amiloride-sensitive/degenerin11242700
    CNG/EAG2299030
    IRK163300
    ITP/ryanodine102400
    Neurotransmitter-gated615159019
    P2X purinoceptor100000
    TASK12124815
    Transient receptor153310
    Voltage-gated Ca2+alpha224822
    Voltage-gated Ca2+alpha-2103200
    Voltage-gated Ca2+beta52200
    Voltage-gated Ca2+gamma10000
    Voltage-gated K+alpha3351100
    Voltage-gated KQT62300
    Voltage-gated Na+ 114491
    Myelin basic protein10000
    Myelin PO50000
    Myelin proteolipid31000
    Myelin-oligodendrocyte glycoprotein10000
    Neuropilin20000
    Plexin92000
    Semaphorin226200
    Synaptotagmin103300
    Immune response
    Defensin30000
    Cytokine 8614100
    GCSF10000
    GMCSF10000
    Intercrine alpha150000
    Intercrine beta50000
    Inteferon80000
    Interleukin261100
    Leukemia inhibitory factor10000
    MCSF10000
    Peptidoglycan recognition protein213000
    Pre-B cell enhancing factor10000
    Small inducible cytokine A140000
    Sl cytokine20000
    TNF90000
    Cytokine receptor 621000
    Bradykinin/C-C chemokine receptor70000
    Fl cytokine receptor20000
    Interferon receptor30000
    Interleukin receptor320000
    Leukocyte tyrosine kinase receptor30000
    MCSF receptor10000
    TNF receptor30000
    Immunoglobulin receptor 590000
    T-cell receptor alpha chain160000
    T-cell receptor beta chain150000
    T-cell receptor gamma chain10000
    T-cell receptor delta chain10000
    Immunoglobulin FC receptor80000
    Killer cell receptor160000
    Polymeric-immunoglobulin receptor40000
    Panther family/subfamily*HFWYA
    MHC class I220000
    MHC class II200000
    Other immunoglobulin 1140000
    Toll receptor–related106000
    Developmental and homeostatic regulators
    Signaling molecules
    Calcitonin30000
    Ephrin82400
    FGF241100
    Glucagon40000
    Glycoprotein hormone beta chain20000
    Insulin10000
    Insulin-like hormone30000
    Nerve growth factor30000
    Neuregulin/heregulin60000
    neuropeptide Y40000
    PDGF11000
    Relaxin30000
    Stannocalcin20000
    Thymopoeitin20100
    Thyomosin beta42000
    TGF-β296400
    VEGF40000
    Wnt186500
    Receptors
    Ephrin receptor122100
    FGF receptor44000
    Frizzled receptor126500
    Parathyroid hormone receptor20000
    VEGF receptor50000
    BDNF/NT-3 nerve growth factor receptor40000
    Kinases and phosphatases
    Dual-specificity protein phosphatase29810411
    S/T and dual-specificity protein
    kinase 3951983151141102
    S/T protein phosphatase1519511329
    Y protein kinase 10647100516
    Y protein phosphatase56229556
    Signal transduction
    ARF family5529271245
    Cyclic nucleotide phosphodiesterase258610
    G protein-coupled receptors†‡ 61614628401
    G-protein alpha27102225
    G-protein beta53211
    G-protein gamma132200
    Ras superfamily14164622686
    G-protein modulators
    ARF GTPase-activating2089515
    Neurofibromin72020
    Ras GTPase-activating93810
    Tuberin73200
    Vav proto-oncogene family35151330
    Panther family/subfamily*HFWYA
    Transcription factors/chromatin organization
    C2H2 zinc finger–containing 60723279288
    COE71100
    CREB71200
    ETS-related2581000
    Forkhead-related34191540
    FOS82100
    Groucho132100
    Histone H150100
    Histone H2A24117313
    Histone H2B21117212
    Histone H328224216
    Histone H4911618
    Homeotic 16810474478
    ABD-B50000
    Bithoraxoid18100
    Iroquois class73100
    Distal-less52100
    Engrailed22100
    LIM-containing178300
    MEIS/KNOX class944226
    NK-3/NK-2 class94500
    Paired box38282302
    Six53400
    Leucine zipper60000
    Nuclear hormone receptor 592518314
    Pou-related155410
    Runt-related34200
    ECM adhesion
    Cadherin113171600
    Claudin200000
    Complement receptor-related228600
    Connexin140000
    Galectin1252200
    Glypican132100
    ICAM60000
    Integrin alpha247401
    Integrin beta92200
    LDL receptor family26192002
    Proteoglycans229705
    Apoptosis
    Bcl-2121000
    Calpain2241113
    Calpain inhibitor40001
    Caspase137300
    Hemostasis
    ADAM/ADAMTS5191200
    Fibronectin30000
    Globin102303
    Matrix metalloprotease192703
    Serum amyloid A40000
    Serum amyloid P (subfamily of Pentaxin)20000
    Serum paraoxonase/arylesterase40300
    Serum albumin40000
    Transglutaminase101000
    Other enzymes
    Cytochrome p4506089833256
    GAPDH463438
    Heparan sulfotransferase114200
    Splicing and translation
    EF-1alpha561310613
    Ribonucleoproteins 26913510460265
    Ribosomal proteins 81211180117256
    • * The table lists Panther families or subfamilies relevant to the text that either (i) are not specifically represented by Pfam (Table 18) or (ii) differ in counts from the corresponding Pfam models.

    • This class represents a number of different families in the same Panther molecular function subcategory.

    • This count includes only rhodopsin-class, secretin-class, and metabotropic glutamate-class GPCRs.

Additional Files


  • Abstract
    Full Text
    The Sequence of the Human Genome
    J. Craig Venter, Mark D. Adams, Eugene W. Myers, Peter W. Li, Richard J. Mural, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, Cheryl A. Evans, Robert A. Holt, Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Huson, Jennifer Russo Wortman, Qing Zhang, Chinnappa D. Kodira, Xiangqun H. Zheng, Lin Chen, Marian Skupski, Gangadharan Subramanian, Paul D. Thomas, Jinghui Zhang, George L. Gabor, Miklos, Catherine Nelson, Samuel Broder, Andrew G. Clark, Joe Nadeau, Victor A. McKusick, Norton Zinder, Arnold J. Levine, Richard J. Roberts, Mel Simon, Carolyn Slayman, Michael Hunkapiller, Randall Bolanos, Arthur Delcher, Ian Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy, Clark Mobarry, Knut Reinert, Karin Remington, Jane Abu-Threideh, Ellen Beasley, Kendra Biddick, Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, Valentina Di Francesco, Patrick Dunn, Karen Eilbeck, Carlos Evangelista, Andrei E. Gabrielian, Weiniu Gan, Wangmao Ge, Fangcheng Gong, Zhiping Gu, Ping Guan, Thomas J. Heiman, Maureen E. Higgins, Rui-Ru Ji, Zhaoxi Ke, Karen A. Ketchum, Zhongwu Lai, Yiding Lei, Zhenya Li, Jiayin Li, Yong Liang, Xiaoying Lin, Fu Lu, Gennady V. Merkulov, Natalia Milshina, Helen M. Moore, Ashwinikumar K Naik, Vaibhav A. Narayan, Beena Neelam, Deborah Nusskern, Douglas B. Rusch, Steven Salzberg, Wei Shao, Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, Ron Wides, Chunlin Xiao, Chunhua Yan, Alison Yao, Jane Ye, Ming Zhan, Weiqing Zhang, Hongyu Zhang, Qi Zhao, Liansheng Zheng, Fei Zhong, Wenyan Zhong, Shiaoping C. Zhu, Shaying Zhao, Dennis Gilbert, Suzanna Baumhueter, Gene Spier, Christine Carter, Anibal Cravchik, Trevor Woodage, Feroze Ali, Huijin An, Aderonke Awe, Danita Baldwin, Holly Baden, Mary Barnstead, Ian Barrow, Karen Beeson, Dana Busam, Amy Carver, Angela Center, Ming Lai Cheng, Liz Curry, Steve Danaher, Lionel Davenport, Raymond Desilets, Susanne Dietz, Kristina Dodson, Lisa Doup, Steven Ferriera, Neha Garg, Andres Gluecksmann, Brit Hart, Jason Haynes, Charles Haynes, Cheryl Heiner, Suzanne Hladun, Damon Hostin, Jarrett Houck, Timothy Howland, Chinyere Ibegwam, Jeffery Johnson, Francis Kalush, Lesley Kline, Shashi Koduru, Amy Love, Felecia Mann, David May, Steven McCawley, Tina McIntosh, Ivy McMullen, Mee Moy, Linda Moy, Brian Murphy, Keith Nelson, Cynthia Pfannkoch, Eric Pratts, Vinita Puri, Hina Qureshi, Matthew Reardon, Robert Rodriguez, Yu-Hui Rogers, Deanna Romblad, Bob Ruhfel, Richard Scott, Cynthia Sitter, Michelle Smallwood, Erin Stewart, Renee Strong, Ellen Suh, Reginald Thomas, Ni Ni Tint, Sukyee Tse, Claire Vech, Gary Wang, Jeremy Wetter, Sherita Williams, Monica Williams, Sandra Windsor, Emily Winn-Deen, Keriellen Wolfe, Jayshree Zaveri, Karena Zaveri, Josep F. Abril, Roderic Guigó, Michael J. Campbell, Kimmen V. Sjolander, Brian Karlak, Anish Kejariwal, Huaiyu Mi, Betty Lazareva, Thomas Hatton, Apurva Narechania, Karen Diemer, Anushya Muruganujan, Nan Guo, Shinji Sato, Vineet Bafna, Sorin Istrail, Ross Lippert, Russell Schwartz, Brian Walenz, Shibu Yooseph, David Allen, Anand Basu, James Baxendale, Louis Blick, Marcelo Caminha, John Carnes-Stine, Parris Caulk, Yen-Hui Chiang, My Coyne, Carl Dahlke, Anne Deslattes Mays, Maria Dombroski, Michael Donnelly, Dale Ely, Shiva Esparham, Carl Fosler, Harold Gire, Stephen Glanowski, Kenneth Glasser, Anna Glodek, Mark Gorokhov, Ken Graham, Barry Gropman, Michael Harris, Jeremy Heil, Scott Henderson, Jeffrey Hoover, Donald Jennings, Catherine Jordan, James Jordan, John Kasha, Leonid Kagan, Cheryl Kraft, Alexander Levitsky, Mark Lewis, Xiangjun Liu, John Lopez, Daniel Ma, William Majoros, Joe McDaniel, Sean Murphy, Matthew Newman, Trung Nguyen, Ngoc Nguyen, Marc Nodell, Sue Pan, Jim Peck, William Rowe, Robert Sanders, John Scott, Michael Simpson, Thomas Smith, Arlan Sprague, Timothy Stockwell, Russell Turner, Eli Venter, Mei Wang, Meiyuan Wen, David Wu, Mitchell Wu, Ashley Xia, Ali Zandieh, Xiaohong Zhu

    Web Fig. 1: Annotation of the Celera Human Genome Assembly


    Supplementary Material

    Supplemental Table 1.Chromosomal distribution of intronless paralogs.
    IPSOURCEIPSOURCEDEFINTION/PROTEIN NAME SOURCE GENBANK INDEX NUMBER
    hCP48693hCP37465chr20chr6BCL2-antagonist/killer 1; cell death inhibitor 1 4502363
    hCP45781hCP44963chr1chr13RNA polymerase I 16 kDa subunit 7705740
    hCP45175hCP40330chr9chr6eukaryotic translation elongation factor 1 alpha 1 4503471
    hCP47770hCP40330chr5chr6eukaryotic translation elongation factor 1 alpha 1 4503471
    hCP48153hCP40330chr7chr6eukaryotic translation elongation factor 1 alpha 1 4503471
    hCP45176hCP40330chr9chr6eukaryotic translation elongation factor 1 alpha 1-like 14 4503473
    hCP47771hCP40330chr5chr6eukaryotic translation elongation factor 1 alpha 1-like 14 4503473
    hCP48154hCP40330chr7chr6eukaryotic translation elongation factor 1 alpha 1-like 14 4503473
    hCP43102hCP45627chr19chr8eukaryotic translation elongation factor 1 delta 4503479
    hCP35783hCP52348chr2chr11eukaryotic translation initiation factor 3, subunit 5 4503519
    hCP34934hCP39451chr2chr12heterogeneous nuclear ribonucleoprotein A1 4504445
    hCP44529hCP39451chr13chr12heterogeneous nuclear ribonucleoprotein A1 4504445
    hCP43335hCP44582chr12chr17non-metastatic cells 2, protein (NM23B) 4505409
    hCP50486hCP40136chr8chr14proteasome (prosome, macropain) 26S subunit, ATPase, 64506215
    hCP42158hCP46912chr12chr2prothymosin, alpha (gene sequence 28) 4506277
    hCP45662hCP37866chr2chr18ras homolog gene family, member B; RhoB; 4757764
    hCP36390hCP37104chr8chr7H2A histone family, member Z4504255
    hCP43150hCP37104chr15chr7H2A histone family, member Z4504255
    hCP46278hCP39295chr4chr3NADH dehydrogenase (ubiquinone) 1 beta6041669
    hCP36897hCP34863chr5chr12RAP1A, member of RAS oncogene4506413
    hCP46034hCP35342chr17chrXadaptor-related protein complex 1, sigma4506957
    hCP35830hCP51363chr6chr4alcohol dehydrogenase 5 (class III),4501937
    hCP44848hCP40594chr9chr1cell division cycle 204557437
    hCP50246hCP49760chr4chr1cell division cycle 42 4757952
    hCP48857hCP48263chr15chr7chromobox homolog 3 6005780
    hCP46158hCP42461chr13chr11diacylglycerol kinase, zeta (104kD) 4503317
    hCP37376hCP49621chr7chr11eukaryotic translation elongation factor 14503481
    hCP36628hCP39069chr15chr8fatty acid binding protein 54557581
    hCP43166hCP39069chr15chr8fatty acid binding protein 54557581
    hCP49159hCP35287chr20chr19ferritin, light polypeptide; hypothetical protein4503797
    hCP38447hCP34306chr19chr16glycine cleavage system protein H4758424
    hCP51879hCP34306chr1chr16glycine cleavage system protein H4758424
    hCP42685hCP43793chr20chr13high-mobility group (nonhistone chromosomal) protein4504425
    hCP42865hCP43793chr3chr13high-mobility group (nonhistone chromosomal) protein4504425
    hCP49883hCP43793chr20chr13high-mobility group (nonhistone chromosomal) protein4504425
    hCP50559hCP43793chr15chr13high-mobility group (nonhistone chromosomal) protein4504425
    hCP50984hCP43793chr22chr13high-mobility group (nonhistone chromosomal) protein4504425
    hCP37795hCP42486chrXchr10phosphoglycerate mutase 1 (brain)4505753
    hCP49866hCP42486chr12chr10phosphoglycerate mutase 1 (brain)4505753
    hCP48871hCP33485chr4chr5pituitary tumor-transforming protein 14758980
    hCP43053hCP201144chr3chr6pre-B-cell leukemia transcription factor 24505625
    hCP47725hCP49987chr5chr14proteasome (prosome, macropain) activator subunit4506237
    hCP38333hCP35418chrXchr7ras-related C3 botulinum toxin substrate5902042
    hCP40764hCP36922chr3chr6ribosomal protein L10a; neural precursor6325472
    hCP45150hCP34348chr17chr16ribosomal protein L13 4506599
    hCP42475hCP35262chr10chr19ribosomal protein L13a 6912634
    hCP43258hCP35262chr12chr19ribosomal protein L13a 6912634
    hCP43889hCP35262chr13chr19ribosomal protein L13a 6912634
    hCP48078hCP35262chr12chr19ribosomal protein L13a 6912634
    hCP51648hCP35262chr10chr19ribosomal protein L13a 6912634
    hCP40196hCP41680chr3chr18ribosomal protein L17 4506617
    hCP35655hCP44971chr1chr13ribosomal protein L21 4506611
    hCP39305hCP44971chr14chr13ribosomal protein L21 4506611
    hCP41351hCP44971chr11chr13ribosomal protein L21 4506611
    hCP47660hCP44971chr4chr13ribosomal protein L21 4506611
    hCP47833hCP44971chr7chr13ribosomal protein L21 4506611
    hCP48726hCP44971chr4chr13ribosomal protein L21 4506611
    hCP49814hCP44971chr10chr13ribosomal protein L21 4506611
    hCP50215hCP44971chr10chr13ribosomal protein L21 4506611
    hCP34220hCP45368chr3chr17ribosomal protein L23a 4506615
    hCP43068hCP45090chr12chr17ribosomal protein L26 4506621
    hCP38068hCP41270chr11chr11ribosomal protein L27a 4506625
    hCP38885hCP41270chr6chr11ribosomal protein L27a 4506625
    hCP34480hCP51948chr3chr3ribosomal protein L29 4506629
    hCP36437hCP43819chr7chr3ribosomal protein L32 4506635
    hCP39494hCP43819chr6chr3ribosomal protein L32 4506635
    hCP201498hCP44392chr1chr9ribosomal protein L35 6005860
    hCP51162hCP44392chr7chr9ribosomal protein L35 6005860
    hCP39467hCP43627chr14chr9ribosomal protein L7a 4506661
    hCP41685hCP48439chr15chr4ribosomal protein L9 4506665
    hCP201561hCP35286chr12chr19ribosomal protein S11 4506681
    hCP42984hCP39006chr11chr6ribosomal protein S12 4506683
    hCP42446hCP47279chr1chr16ribosomal protein S15a 4506689
    hCP201365hCP52071chr19chr19ribosomal protein S16 4506691
    hCP42269hCP52071chr1chr19ribosomal protein S16 4506691
    hCP40118hCP42669chr5chr15ribosomal protein S17 4506693
    hCP51382hCP42669chr22chr15ribosomal protein S17 4506693
    hCP35240hCP42007chr8chr12ribosomal protein S26 4506709
    hCP35551hCP42007chr2chr12ribosomal protein S26 4506709
    hCP36459hCP42007chrXchr12ribosomal protein S26 4506709
    hCP38950hCP42007chr4chr12ribosomal protein S26 4506709
    hCP39533hCP42007chr8chr12ribosomal protein S26 4506709
    hCP41719hCP42007chr13chr12ribosomal protein S26 4506709
    hCP43701hCP42007chr9chr12ribosomal protein S26 4506709
    hCP44708hCP42007chr9chr12ribosomal protein S26 4506709
    hCP45758hCP42007chr15chr12ribosomal protein S26 4506709
    hCP46725hCP42007chr7chr12ribosomal protein S26 4506709
    hCP50862hCP42007chr8chr12ribosomal protein S26 4506709
    hCP51315hCP42007chr10chr12ribosomal protein S26 4506709
    hCP38914hCP34424chr16chr2ribosomal protein S27a 4506713
    hCP50884hCP34424chr1chr2ribosomal protein S27a 4506713
    hCP50962hCP38574chr22chr19ribosomal protein S9 4506745
    hCP37989hCP35777chrXchr3teratocarcinoma-derived growth factor 1 4507425
    hCP40506hCP35239chr10chr8tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation4507953
    hCP50191hCP35239chr2chr8tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation 4507953

    Supplemental Table 2. Examples of paralogs with disease associations on duplicated genome segments. (Both panels should be viewed in a linear format to appreciate the similarities in the diseases caused by the paralogous duplications)
    Gene nameProteinChrDiseaseOMIM
    complement component 8 hCP407791C8 deficiency, type II 120960
    nuclear receptor subfamily 3, group ChCP482594Pseudohypoaldosteronism type I, 600983
    complement component 9hCP475125C9 deficiency Immune120940
    glucocorticoid receptorhCP475585Cortisol resistance Metabolic138040
    hexosaminidase B (beta polypeptide) hCP481235Sandhoff disease, Neurological268800
    homeo box A13hCP489157Hand-foot-uterus syndrome 142959
    tyrosine hydroxylasehCP3570911Segawa syndrome, 191290
    phenylalanine hydroxylasehCP3937412Hyperphenylalaninemia, 261600
    connexin 26hCP3817013Deafness, autosomal dominant 121011
    coagulation factor VIIhCP4344113Factor VII deficiency 227500
    coagulation factor XhCP4344213Factor X deficiency 227600
    insulin promoter transcription factor 1hCP4496613Pancreatic agenesis 600733
    hexosaminidase AhCP5022815Hex A pseudodeficiency 272800
    coagulation factor IXhCP35448XHemophilia B Hematological306900
    coagulation factor IXhCP35448XHemophilia B Hematological306900
    connexin 32hCP37674XCharcot-Marie-Tooth neuropathy 304040
    Duplicated gene nameHomologChrDiseaseOMIM
    complement component 9hCP475125C9 deficiency120940
    glucocorticoid receptorCP475585Cortisol resistance Metabolic138040
    complement component 8 hCP407791C8 deficiency, type II Immune120960
    nuclear receptor subfamily 3hCP482594Pseudohypoaldosteronism type I 600983
    hexosaminidase AhCP5022815Hex A pseudodeficiency Neurological272800
    insulin promoter TF-1hCP4496613Pancreatic agenesis Gastrointestinal600733
    phenylalanine hydroxylasehCP3937412Hyperphenylalaninemia, mild Neurological261600
    tyrosine hydroxylasehCP3570911Segawa syndrome, recessive Neurological191290
    connexin 32hCP3767423Charcot-Marie-Tooth neuropathy 304040
    coagulation factor IXhCP3544823Hemophilia B Hematological306900
    coagulation factor IXhCP3544823Hemophilia B Hematological306900
    homeo box A13hCP489157Hand-foot-uterus syndrome Renal142959
    hexosaminidase B hCP481235Sandhoff disease, infantile 268800
    coagulation factor VIIhCP4344113Factor VII deficiency 227500
    coagulation factor XhCP4344213Factor X deficiency 227600
    connexin 26hCP3817013Deafness, autosomal dominant 3 121011


    Supplemental Figure 2A. Karyotype analysis of donors.


    Medium version | Full size version


    Supplemental Figure 2B.


    Medium version | Full size version


    Supplemental Figure 2C.


    Medium version | Full size version


    Supplemental Figure 2D.


    Medium version | Full size version


    Supplemental Figure 2E.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 1. Comparison of the CSA and the PFP assembly. To generate the figure, Celera fragment sequences were mapped onto each assembly. The PFP assembly is indicated in the upper third of each panel; the Celera assembly is indicated in the lower third. In the center of the panel, green lines show Celera sequences that are in the same order and orientation in both assemblies and form the longest consistently ordered run of sequences. Yellow lines indicate sequence blocks that are in the same orientation, but out of order. Red lines indicate sequence blocks that are not in the same orientation. For clarity, in the latter two cases, lines are only drawn between segments of matching sequence that are at least 50 kbp long. The top and bottom thirds of each panel show the extent of Celera mate-pair violations (red, misoriented; yellow, incorrect distance between the mates) for each assembly grouped by library size. (Mate pairs that are within the correct distance, as expected from the mean library insert size, are omitted from the figure for clarity.) Predicted breakpoints, corresponding to stacks of violated mate pairs of the same type, are shown as blue ticks on each assembly axis. Runs of more than 10,000 Ns are shown as cyan bars. Plots for each of the 24 chromosomes can be seen as separate files.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 2.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 3.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 4.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 5.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 6.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 7.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 8.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 9.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 10.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 11.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 12.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 13.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 14.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 15.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 16.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 17.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 18.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 19.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 20.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 21.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 22.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 23.


    Medium version | Full size version


    Supplemental Figure 3 -Chromosome 24.


    Medium version | Full size version


  • Abstract
    Full Text
    The Sequence of the Human Genome
    J. Craig Venter,* Mark D. Adams, Eugene W. Myers, Peter W. Li, Richard J. Mural, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, Cheryl A. Evans, Robert A. Holt, Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Huson, Jennifer Russo Wortman, Qing Zhang, Chinnappa D. Kodira, Xiangqun H. Zheng, Lin Chen, Marian Skupski, Gangadharan Subramanian, Paul D. Thomas, Jinghui Zhang, George L. Gabor, Miklos, Catherine Nelson, Samuel Broder, Andrew G. Clark, Joe Nadeau, Victor A. McKusick,Norton Zinder, Arnold J. Levine, Richard J. Roberts, Mel Simon, Carolyn Slayman, Michael Hunkapiller, Randall Bolanos, Arthur Delcher, Ian Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy, Clark Mobarry, Knut Reinert, Karin Remington, Jane Abu-Threideh, Ellen Beasley, Kendra Biddick, Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, Valentina Di Francesco, Patrick Dunn, Karen Eilbeck, Carlos Evangelista, Andrei E. Gabrielian, Weiniu Gan, Wangmao Ge, Fangcheng Gong, Zhiping Gu, Ping Guan, Thomas J. Heiman, Maureen E. Higgins, Rui-Ru Ji, Zhaoxi Ke, Karen A. Ketchum, Zhongwu Lai, Yiding Lei, Zhenya Li, Jiayin Li, Yong Liang, Xiaoying Lin, Fu Lu, Gennady V. Merkulov, Natalia Milshina, Helen M. Moore, Ashwinikumar K Naik, Vaibhav A. Narayan, Beena Neelam, Deborah Nusskern, Douglas B. Rusch, Steven Salzberg, Wei Shao, Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, Ron Wides, Chunlin Xiao, Chunhua Yan, Alison Yao, Jane Ye, Ming Zhan, Weiqing Zhang, Hongyu Zhang, Qi Zhao, Liansheng Zheng, Fei Zhong, Wenyan Zhong, Shiaoping C. Zhu, Shaying Zhao, Dennis Gilbert, Suzanna Baumhueter, Gene Spier, Christine Carter, Anibal Cravchik, Trevor Woodage, Feroze Ali, Huijin An, Aderonke Awe, Danita Baldwin, Holly Baden, Mary Barnstead, Ian Barrow, Karen Beeson, Dana Busam, Amy Carver, Angela Center, Ming Lai Cheng, Liz Curry, Steve Danaher, Lionel Davenport, Raymond Desilets, Susanne Dietz, Kristina Dodson, Lisa Doup, Steven Ferriera, Neha Garg, Andres Gluecksmann, Brit Hart, Jason Haynes, Charles Haynes, Cheryl Heiner, Suzanne Hladun, Damon Hostin, Jarrett Houck, Timothy Howland, Chinyere Ibegwam, Jeffery Johnson, Francis Kalush, Lesley Kline, Shashi Koduru, Amy Love, Felecia Mann, David May, Steven McCawley, Tina McIntosh, Ivy McMullen, Mee Moy, Linda Moy, Brian Murphy, Keith Nelson, Cynthia Pfannkoch, Eric Pratts, Vinita Puri, Hina Qureshi, Matthew Reardon, Robert Rodriguez, Yu-Hui Rogers, Deanna Romblad, Bob Ruhfel, Richard Scott, Cynthia Sitter, Michelle Smallwood, Erin Stewart, Renee Strong, Ellen Suh, Reginald Thomas, Ni Ni Tint, Sukyee Tse, Claire Vech, Gary Wang, Jeremy Wetter, Sherita Williams, Monica Williams, Sandra Windsor, Emily Winn-Deen, Keriellen Wolfe, Jayshree Zaveri, Karena Zaveri, Josep F. Abril, Roderic Guigó, Michael J. Campbell, Kimmen V. Sjolander, Brian Karlak, Anish Kejariwal, Huaiyu Mi, Betty Lazareva, Thomas Hatton, Apurva Narechania, Karen Diemer, Anushya Muruganujan, Nan Guo, Shinji Sato, Vineet Bafna, Sorin Istrail, Ross Lippert, Russell Schwartz, Brian Walenz, Shibu Yooseph, David Allen, Anand Basu, James Baxendale, Louis Blick, Marcelo Caminha, John Carnes-Stine, Parris Caulk, Yen-Hui Chiang, My Coyne, Carl Dahlke, Anne Deslattes Mays, Maria Dombroski, Michael Donnelly, Dale Ely, Shiva Esparham, Carl Fosler, Harold Gire, Stephen Glanowski, Kenneth Glasser, Anna Glodek, Mark Gorokhov, Ken Graham, Barry Gropman, Michael Harris, Jeremy Heil, Scott Henderson, Jeffrey Hoover, Donald Jennings, Catherine Jordan, James Jordan, John Kasha, Leonid Kagan, Cheryl Kraft, Alexander Levitsky, Mark Lewis, Xiangjun Liu, John Lopez, Daniel Ma, William Majoros, Joe McDaniel, Sean Murphy, Matthew Newman, Trung Nguyen, Ngoc Nguyen, Marc Nodell, Sue Pan, Jim Peck, William Rowe, Robert Sanders, John Scott, Michael Simpson, Thomas Smith, Arlan Sprague, Timothy Stockwell, Russell Turner, Eli Venter, Mei Wang, Meiyuan Wen, David Wu, Mitchell Wu, Ashley Xia, Ali Zandieh, Xiaohong Zhu

    Web Fig. 1: Annotation of the Celera Human Genome Assembly

    Initial annotations of the Celera compartmentalized shotgun assembly (CSA) of the human genome including transcripts, sequence characteristics, polymorphisms, and molecular markers are presented. Each track of the figure is divided into three areas: forward-strand transcripts, sequence analysis, and reverse-strand transcripts (from top to bottom, respectively). The end of each chromosome tier is depicted as white space as it not yet clear that the CSA includes the telomeres. The genome sequence is displayed on a nucleotide scale of approximately 600 kbp/cm. Molecular genetic markers are shown above the nucleotide scale at the top of each track and are derived from the Marshfield map (http://research.marshfieldclinic.org/genetics/Map_Markers/maps/IndexMapFrames.html). Genes are adjacent to the sequence analysis tiers. They are color-coded by the algorithm used to define the transcript structure (see figure key) and are given a minimum length of 20 kb for display purposes. The structure of transcripts with two or more exons is displayed in one of two expanded transcript tiers at 120 kb/cm resolution above or below the genes for forward- and reverse-strand transcripts, respectively. Exons are depicted as black boxes and intronic regions are color-coded for transcripts assigned to the 14 largest Gene Ontology (GO, http://www.geneontology.org) categories. Single-exon transcripts are color-coded by GO classification and are displayed in a tier between the unexpanded transcripts and the sequence analysis tiers. Transcripts predicted by Celera's annotation algorithm (Otto) that correspond to RefSeq transcripts (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html) are assigned HUGO gene symbols (http://www.gene.ucl.ac.uk/nomenclature) if the RefSeq transcripts are associated with HUGO symbols by LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) and if the transcripts are longer than 25 kbp (to prevent overlap of gene symbols). There are three sequence analyses in the middle section of the tracks: G+C content, CpG Islands and SNP density. G+C content is depicted in a nonlinear scale described in the legend. A black box indicates the position of CpG islands. SNPs were identified by comparison of the Celera sequence with a genome assembly available at http://genome.ucsc.edu/. The range of SNP density is depicted above the color gradient in the legend. The natural log of the SNP density is used to color-code the SNP density analysis tier. Gaps within scaffolds are visible as white space in the G+C content tier if the gap is sufficiently large. Gaps between scaffolds are assigned a length of 2 kbp. Scaffold order along the chromosomes was determined by mate-pair information and alignment of scaffold sequence to the GeneMap'99 STS map (http://www.ncbi.nlm.nih.gov/genemap99/) and the Washington University BAC fingerprint map (http://www.genome.wustl.edu/gsc/mapping/). The centromere is depicted as a blue line crossing the annotation tiers and its position is approximated by the transition from p to q arms along the genome sequence, except for acrocentric chromosomes for which the centromere is placed at the beginning of the sequence analysis tiers.

    The figure was generated with "gff2ps" (http://www1.imim.es/software/gfftools/GFF2PS.html), a genome annotation tool that converts General Feature Formatted records (http://www.sanger.ac.uk/Software/formats/GFF/) to a PostScript output [J. F. Abril, R. Guigó, Bioinformatics 16, 743 (2000)]. [PDF of figure caption text]

    Downloadable PDFs of Chromosome Maps
    Chromosome 1(889K)Chromosome 9(555K)Chromosome 17(611K)
    Chromosome 2(788K)Chromosome 10(557K)Chromosome 18(412K)
    Chromosome 3(713K)Chromosome 11(672K)Chromosome 19(620K)
    Chromosome 4(598K)Chromosome 12(648K)Chromosome 20(460K)
    Chromosome 5(647K)Chromosome 13(444K)Chromosome 21(354K)
    Chromosome 6(659K)Chromosome 14(499K)Chromosome 22(440K)
    Chromosome 7(608K)Chromosome 15(485K)Chromosome X(540K)
    Chromosome 8(547K)Chromosome 16(504K)Chromosome Y(296K)
    Legend Key
    PDF of Legend Key

  • Abstract
    Full Text
    The Sequence of the Human Genome
    J. Craig Venter, Mark D. Adams, Eugene W. Myers, Peter W. Li, Richard J. Mural, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, Cheryl A. Evans, Robert A. Holt, Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Huson, Jennifer Russo Wortman, Qing Zhang, Chinnappa D. Kodira, Xiangqun H. Zheng, Lin Chen, Marian Skupski, Gangadharan Subramanian, Paul D. Thomas, Jinghui Zhang, George L. Gabor, Miklos, Catherine Nelson, Samuel Broder, Andrew G. Clark, Joe Nadeau, Victor A. McKusick, Norton Zinder, Arnold J. Levine, Richard J. Roberts, Mel Simon, Carolyn Slayman, Michael Hunkapiller, Randall Bolanos, Arthur Delcher, Ian Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy, Clark Mobarry, Knut Reinert, Karin Remington, Jane Abu-Threideh, Ellen Beasley, Kendra Biddick, Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, Valentina Di Francesco, Patrick Dunn, Karen Eilbeck, Carlos Evangelista, Andrei E. Gabrielian, Weiniu Gan, Wangmao Ge, Fangcheng Gong, Zhiping Gu, Ping Guan, Thomas J. Heiman, Maureen E. Higgins, Rui-Ru Ji, Zhaoxi Ke, Karen A. Ketchum, Zhongwu Lai, Yiding Lei, Zhenya Li, Jiayin Li, Yong Liang, Xiaoying Lin, Fu Lu, Gennady V. Merkulov, Natalia Milshina, Helen M. Moore, Ashwinikumar K Naik, Vaibhav A. Narayan, Beena Neelam, Deborah Nusskern, Douglas B. Rusch, Steven Salzberg, Wei Shao, Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, Ron Wides, Chunlin Xiao, Chunhua Yan, Alison Yao, Jane Ye, Ming Zhan, Weiqing Zhang, Hongyu Zhang, Qi Zhao, Liansheng Zheng, Fei Zhong, Wenyan Zhong, Shiaoping C. Zhu, Shaying Zhao, Dennis Gilbert, Suzanna Baumhueter, Gene Spier, Christine Carter, Anibal Cravchik, Trevor Woodage, Feroze Ali, Huijin An, Aderonke Awe, Danita Baldwin, Holly Baden, Mary Barnstead, Ian Barrow, Karen Beeson, Dana Busam, Amy Carver, Angela Center, Ming Lai Cheng, Liz Curry, Steve Danaher, Lionel Davenport, Raymond Desilets, Susanne Dietz, Kristina Dodson, Lisa Doup, Steven Ferriera, Neha Garg, Andres Gluecksmann, Brit Hart, Jason Haynes, Charles Haynes, Cheryl Heiner, Suzanne Hladun, Damon Hostin, Jarrett Houck, Timothy Howland, Chinyere Ibegwam, Jeffery Johnson, Francis Kalush, Lesley Kline, Shashi Koduru, Amy Love, Felecia Mann, David May, Steven McCawley, Tina McIntosh, Ivy McMullen, Mee Moy, Linda Moy, Brian Murphy, Keith Nelson, Cynthia Pfannkoch, Eric Pratts, Vinita Puri, Hina Qureshi, Matthew Reardon, Robert Rodriguez, Yu-Hui Rogers, Deanna Romblad, Bob Ruhfel, Richard Scott, Cynthia Sitter, Michelle Smallwood, Erin Stewart, Renee Strong, Ellen Suh, Reginald Thomas, Ni Ni Tint, Sukyee Tse, Claire Vech, Gary Wang, Jeremy Wetter, Sherita Williams, Monica Williams, Sandra Windsor, Emily Winn-Deen, Keriellen Wolfe, Jayshree Zaveri, Karena Zaveri, Josep F. Abril, Roderic Guigó, Michael J. Campbell, Kimmen V. Sjolander, Brian Karlak, Anish Kejariwal, Huaiyu Mi, Betty Lazareva, Thomas Hatton, Apurva Narechania, Karen Diemer, Anushya Muruganujan, Nan Guo, Shinji Sato, Vineet Bafna, Sorin Istrail, Ross Lippert, Russell Schwartz, Brian Walenz, Shibu Yooseph, David Allen, Anand Basu, James Baxendale, Louis Blick, Marcelo Caminha, John Carnes-Stine, Parris Caulk, Yen-Hui Chiang, My Coyne, Carl Dahlke, Anne Deslattes Mays, Maria Dombroski, Michael Donnelly, Dale Ely, Shiva Esparham, Carl Fosler, Harold Gire, Stephen Glanowski, Kenneth Glasser, Anna Glodek, Mark Gorokhov, Ken Graham, Barry Gropman, Michael Harris, Jeremy Heil, Scott Henderson, Jeffrey Hoover, Donald Jennings, Catherine Jordan, James Jordan, John Kasha, Leonid Kagan, Cheryl Kraft, Alexander Levitsky, Mark Lewis, Xiangjun Liu, John Lopez, Daniel Ma, William Majoros, Joe McDaniel, Sean Murphy, Matthew Newman, Trung Nguyen, Ngoc Nguyen, Marc Nodell, Sue Pan, Jim Peck, William Rowe, Robert Sanders, John Scott, Michael Simpson, Thomas Smith, Arlan Sprague, Timothy Stockwell, Russell Turner, Eli Venter, Mei Wang, Meiyuan Wen, David Wu, Mitchell Wu, Ashley Xia, Ali Zandieh, Xiaohong Zhu

    General Information

    The following article is not an official Japanese translation by the staff of Science, nor is it endorsed by Science as accurate. Rather this work has been created by Alliance of Science and Communication Advancement (ASCA). While done with best efforts by professionally qualified translators and editors, this translation may differ slightly from the original English-Language version. In crucial matters, please refer to Science's official English-language version.

    Viewing the Japanese Language PDF

    This Japanese PDF file can be viewed by using version 5.0 of Adobe Acrobat along with the Japanese font pack (the latter is available at http://www.adobe.com/products/acrobat/acrrasianfontpack.html .) If you have an earlier version of Acrobat or Acrobat Reader, you will need to have version 3.0 or higher running under a Japanese OS (Win or Mac).

    Disclaimer Text in Japanese

    Click here to view the Japanese language version PDF file.



Navigate This Article