The Complete Genome Sequence of Escherichia coli K-12

See allHide authors and affiliations

Science  05 Sep 1997:
Vol. 277, Issue 5331, pp. 1453-1462
DOI: 10.1126/science.277.5331.1453


  • Figure 1

    The overall structure of theE. coli genome. The origin and terminus of replication are shown as green lines, with blue arrows indicating replichores 1 and 2. A scale indicates the coordinates both in base pairs and in minutes (actually centisomes, or 100 equal intervals of the DNA). The distribution of genes is depicted on two outer rings: The orange boxes are genes located on the presented strand, and the yellow boxes are genes on the opposite strand. Red arrows show the location and direction of transcription of rRNA genes, and tRNA genes are shown as green arrows. The next circle illustrates the positions of REP sequences around the genome as radial tick marks. The central orange sunburst is a histogram of inverse CAI (1 – CAI), in which long yellow rays represent clusters of low (<0.25) CAI. The CAI plot is enclosed by a ring indicating similarities between previously described bacteriophage proteins and the proteins encoded by the complete E. coli genome; the similarity is plotted as described in Fig. 3 for the complete genome comparisons.

  • Figure 2

    Base composition is not randomly distributed in the genome. G-C skew [(G – C)/(G + C)] is plotted as a 10-kb window average for one strand of the entire E. coligenome. Skew plots for the three codon positions are presented separately; leftward genes, rightward genes, and non–protein-coding regions are shown in lines 5, 6, and 7. The two horizontal lines below the skew plots show the distribution of two highly skewed octamer sequences, GCTGGTGG (Chi) and GCAGGGCG (8-mer). Tick marks indicate the position of each copy of a sequence in the complete genome and are vertically offset to indicate the strand containing the sequence. The next 18 horizontal lines correspond to distinct classes of repetitive elements. The penultimate line contains a histogram showing the similarity (the product of the percent of each protein in the pairwise alignment and the percent amino acid identity across the aligned region) of known phage proteins to the proteins encoded by the complete E. coli genome. The last line indicates the position and orientation of the EcoK restriction-modification site AACNNNNNNGTGC (N, any nucleotide). Two vertical lines through the plots show the location of the origin and terminus of replication.


  • Table 1

    G-C skew for each of the three codon positions, calculated separately for the coding strand of 2357 forward genes (whose coding strand is the leading strand) and 1929 backward genes (whose coding strand is the lagging strand). The net skew attributable to replication direction is the difference between the values for the forward and the backward genes divided by 2.

    PositionForward genes Backward genes Average G-C skew Net G-C skew attributable to replication direction
    119.41 16.08 17.74 1.66
    2 –9.34 –11.79–10.57 1.22
    3 7.99 –0.48 3.75 4.23
    Average 6.02 1.27 3.642.37
  • Table 2

    Frequent octamers and their skew. The 24 most frequent octamers are ranked by frequency of occurrence on the leading strand (octamers with the same frequency of occurrence are ordered alphabetically). Frequent octamers that are reverse complements of frequent octamers are identified by their rank (in parentheses) beside that of their complement. All primary sequences are aligned by the CTG trimer. The average spacing for nonoverlapping sequences from this list is on the order of 1.3 kb. The percent skew is 100 × (f – f′)/(f + f′), wheref is the frequency of an octamer and f′that of its reverse complement.

    Rank OctamerSkew (%) Count
    1 (9)cgCTGgcg 15.6 867
    2 (16)ggcgCTGg 19.6 826
    3 (= Chi) gCTGgtgg 50.8761
    4 (17) gCTGgcgg 13.1 719
    5 (11)tgCTGgcg 9.4 719
    6 gcgCTGgc 17.2 691
    7tggcgCTG 15.4 677
    8 (24) gCTGgcgc 12.6 659
    10 cgCTGgtg 27.0 617
    12 CTGgcggc 16.2589
    13 CTGgcgca 13.2 575
    14 gCTGgcga9.4 570
    15 TGgcggcg 19.3 561
    18 aaCTGgcg 12.5 543
    19 gCTGgaag 11.0538
    20 CTGgcgcg 15.4 524
    21 gcgCTGga16.4 519
    22 CTGgcgaa 14.9 515
    23tgCTGgtg 29.1515
  • Table 3

    Distribution of CTAG sequences.

    Category of DNA CTAG countAverage spacing
    All E. coli 886 7161
    Protein-coding sequence 569 7159
    TAG terminators 67
    REP sequences 4 6144
    All non–protein-coding sequences 317 1782
    Regulatory regions251 1999
    rRNA genes 46 697
    tRNA genes 13514
    10Sa RNA (ssrA) 2 233
    RNase P M1 RNA (rnpB) 1 377
    Expected from base composition 18,101 256
  • Table 4

    Distribution of E. coli proteins among 22 functional groups (simplified schema).

    Functional class Number Percent of total
    Regulatory function 45 1.05
    Putative regulatory proteins 133 3.10
    Cell structure 182 4.24
    Putative membrane proteins 13 0.30
    Putative structural proteins 42 0.98
    Phage, transposons, plasmids 872.03
    Transport and binding proteins 281 6.55
    Putative transport proteins 146 3.40
    Energy metabolism243 5.67
    DNA replication, recombination, modification, and repair 115 2.68
    Transcription, RNA synthesis, metabolism, and modification 55 1.28
    Translation, posttranslational protein modification 182 4.24
    Cell processes (including adaptation, protection) 188 4.38
    Biosynthesis of cofactors, prosthetic groups, and carriers 103 2.40
    Putative chaperones 9 0.21
    Nucleotide biosynthesis and metabolism58 1.35
    Amino acid biosynthesis and metabolism 1313.06
    Fatty acid and phospholipid metabolism 48 1.12
    Carbon compound catabolism 130 3.03
    Central intermediary metabolism 188 4.38
    Putative enzymes 2515.85
    Other known genes (gene product or phenotype known) 260.61
    Hypothetical, unclassified, unknown 163238.06
    Total 4288 100.00*
    • * Total of these rounded values is 99.97%.

Stay Connected to Science

Navigate This Article