Research Article

Long-read sequence assembly of the gorilla genome

+ See all authors and affiliations

Science  01 Apr 2016:
Vol. 352, Issue 6281, aae0344
DOI: 10.1126/science.aae0344
  • Long-read sequence assembly of the gorilla genome.

    (A) Susie, a female Western lowland gorilla, was used as the reference sample for full-genome sequencing and assembly [photograph courtesy of Max Block]. (B and C) A treemaps representing the differences in fragmentation of the long-read and short-read gorilla genome assemblies. The rectangles are the largest contigs that cumulatively make up 300 Mbp (~10%) of the assembly.

  • Fig. 1 Gorilla genome assembly.

    (A) Schematic depicting assembly contig lengths (contig N50 = 9.6 Mbp) mapped to human GRCh38 chromosomes. The first two rows of black rectangles represent contigs >3 Mbp, the blue rectangles correspond to contigs ≤3 Mbp, and red rectangles correspond to blocks of human/gorilla segmental duplications >100 kbp. (B) Mappability and satellite content of Susie3 contigs. Satellite content defined by use of RepeatMasker (28) and Tandem Repeats Finder (29). Contigs that are unable to map to GRCh38 by using BLASR (colored red) (30) contain a high fraction of satellite sequence. (C) Length distribution of gaps in the published gorilla assembly gorGor3 closed by Susie3 and containing exons or regulatory regions. Of the gaps in gorGor3, 94% were closed in Susie3, with thousands corresponding to missing exons (red) and putative noncoding regulatory DNA (blue).

  • Fig. 2 Gorilla genome ideogram.

    Schematic depicting assembly contig lengths mapped to gorilla chromosomes. The first two rows of black rectangles represent contigs >3 Mbp, the green rectangles correspond to contigs >1 Mbp and ≤3 Mbp, and blue rectangles correspond to contigs ≤1 Mbp.

  • Fig. 3 Comparison of gorilla genome assemblies.

    The contig length distribution for the resulting long-read assembly (Susie3) is 2 to 3 orders of magnitude larger when compared with previous gorilla genome assemblies (gorGor3 and gorGor4) that were generated by using Illumina and Sanger sequencing technology.

  • Fig. 4 Gene annotation and structural variation.

    (A) Proportion of GENCODE transcripts with assembly errors when aligned with gorilla assemblies Susie3 and gorGor3, and three reference assemblies, including orangutan (ponAbe2), chimpanzee (panTro4), and squirrel monkey (saiBol1). Examples of assembly errors include transcript mappings extending off the end of contigs/scaffolds, containing unknown bases, or incomplete transcript mapping. (B) An example of a gene, otoancorin (OTOA), with complete exon representation (red ticks) resolved in the new assembly. Red bars on gorGor3 sequence indicate gaps in the assembly. Alignments between gorilla assemblies are based on Miropeats (31). (C) Alignment of MHC Class II locus in Susie3 against GRCh37 with Miropeats. Alignment identities of collinear blocks between assemblies are shown above the corresponding GRCh37 sequence. Repeats internal to Susie3 are shown in red along the coordinates. Alignment identity across the entire locus is shown below the Susie3 contigs in 5-kbp windows (1 kbp sliding). Support for the proper organization of the Susie3 sequence is shown by the tiling path of concordant BAC end sequences from the Kamilah BAC library (CHORI-277). (D) A sequence-resolved complex gorilla genome structural variation orthologous to human chromosome 19:38,867,213−39,866,620 (GRCh38). The dot-matrix plot shows a 125,375-bp inversion flanked by a proximal 16-kbp deletion and 8-kbp insertion, and a 23-kbp distal deletion. The deletions remove the entire sequences of the SELV and CLC genes in gorilla when compared with human.

  • Fig. 5 Improved mobile element resolution.

    (Left) PTERV1 and SVA insertion length and percent identity distributions in Susie3 (blue) and gorGor3 (red). The PTERV1 and SVA elements in gorGor3 are biased toward short but on average higher identity alignments to the consensus sequence because the more divergent long terminal repeat sequences are not resolved. (Right) The mean and median insertion lengths for gorGor3 and Susie3 are PTERV1, 2194.93, 7565.85 (median 1223 and 7725) and SVA, 1240.1, and 1965.63 (median 1162 and 1909).

  • Fig. 6 Population genetic analyses.

    (A) Density of average divergence within 1-Mbp windows between human (GRCh38) and gorGor3, Susie3, or chimpanzee (panTro4) autosomes. (B) A comparison of human-gorGor3 and human-Susie3 divergence over 1-Mbp windows. The x axis is Alu coverage in each window, and the y axis is the difference in human-gorilla divergence between gorGor3 and Susie3. Positive y axis values indicate increased human–Susie3 divergence relative to human–gorGor3. The increased divergence of human–gorGor3 correlates with Alu content (slope, –0.0044094; intercept, 0.0001486; Pearson’s correlation, –0.60). (C) The effective population size (Ne) shown over time. A PSMC model was applied to the western lowland gorilla based on different genome assemblies. Illumina genome sequence data from western lowland gorillas (Abe, Amani, Coco, Tzambo) was mapped against gorGor3 (green) and Susie3 (orange), and PSMC was fit to the genome alignments (-N25 -t15 -r5 -b -p “4+25*2+4+6”; mutation rate = 1.25 × 10–8; generation time = 19 years). There are 100 bootstrap replicates for each gorilla and model. (D) The distribution of the bootstrap intervals that overlap 50 ka and 5 ma. At 50 ka, Susie3 estimates of the effective population size are significantly higher than that for gorGor3; the inverse pattern is true for 5 ma. All differences between Susie3 and gorGor3 are significant (***P ≤ 0.0001; Welch two-sample t test).

  • Table 1 Gorilla assembly statistics.

    Total genome length (bp)3,035,660,1443,080,414,926
    Number of contigs464,87416,073
    Total sequence (bp)2,828,866,5753,080,414,926
    Placed contig length (bp)2,718,960,0622,790,620,487
    Unplaced contig length (bp)109,906,513289,794,439
    Maximum contig length (bp)191,55636,219,563
    Contig N50 (bp)11,6619,558,608
    Number of scaffolds57,196554
    Maximum scaffold length10,247,101*110,018,866
    Scaffold N50 (bp)913,45823,141,960

    *Values are taken from previously published gorilla genome paper (4).

    • Table 2 Gorilla genome structural variants.

      Contigs greater than 200 kbp were mapped to GRCh38 by using BLASR (30). Nonrepetitive sequences contained at most 70% of sequence annotated as repeat by RepeatMasker (3.3.0) (28) or Tandem Repeats Finder 4.07b (29). Mosaic repeats are defined as one or more different repeat annotations; however, mosaic structural variants composed solely of Alu are listed separately because of their frequency.

      Total basesCountFixedAverage
      Total bases
      Complex, not repetitive15,38313,190709.311807.291091128413,74910,250763.171,95410492765
      Complex, repetitive7,5527,2432716.962746.6320,518,4847,4505,9342757.392906.6120,542,529
      Tandem Repeats7,3514,520367.74811.032,703,2617,8284,225323.38558.52,531,424
      Not base-pair resolved41N/A49,49623817.46202932361N/A55,97143,4963,414,209

    Supplementary Materials

    • Long-read sequence assembly of the gorilla genome

      David Gordon, John Huddleston, Mark J. P. Chaisson, Christopher M. Hill, Zev N. Kronenberg, Katherine M. Munson, Maika Malig, Archana Raja, Ian Fiddes, LaDeana W. Hillier, Christopher Dunn, Carl Baker, Joel Armstrong, Mark Diekhans, Benedict Paten, Jay Shendure, Richard K. Wilson, David Haussler, Chen-Shan Chin, Evan E. Eichler

      Materials/Methods, Supplementary Text, Tables, Figures, and/or References

      Download Supplement
      • Supplementary Text
      • Figs. S1 to S63
      • Tables S1, S4-S6, S10-S13, S15, S21, S26, S27, S33
      • Full Reference List
      Tables S2, S3, S7 to S9, S14, S16 to S20, S22 to S25, S28 to S32, S34, and S35
      Table S36