Research Article

Quantitative analysis of population-scale family trees with millions of relatives

Science  13 Apr 2018:
Vol. 360, Issue 6385, pp. 171-175
DOI: 10.1126/science.aam9309
  • Fig. 1 Overview of the collected data.

    (A) The basic algorithmic steps to form valid pedigree structures from the input data available via the Geni API. Gray, profiles; red, marriages. See fig. S2 for a comprehensive overview. The last step shows an example of a real pedigree from the website with ~6000 individuals spanning about seven generations. (B) Size distribution of the largest 1000 family trees after data cleaning, sorted by size.

  • Fig. 2 Analysis and validation of demographic data.

    (A) Distribution of life expectancy per year. Colors correspond to the frequency of profiles of individuals who died at a certain age for each year. Asterisks indicate deaths at military age in the Civil War and First and Second World Wars. (B) Expected life span in Geni (black) and the Oeppen and Vaupel study [red (27)] as a function of year of death. (C) Comparison of the life-span distributions versus Geni (black) and HMD (red). See also fig. S5A. (D) Geographic distribution of the annotated place-of-birth information. Every pixel corresponds to a profile in the data set. (E) Validation of geographical assignment by historical trends. Top: Cumulative distribution of profiles since 1500 for each city on a logarithmic scale as a function of time. Bottom: Year of first settlement in the city.

  • Fig. 3 The genetic architecture of longevity.

    (A) Regression (red) of child longevity on its mid-parent longevity (defined as difference between age of death and expected life span). Black squares, average longevity of children binned by the mid-parent value; gray bars, estimated 95% confidence interval (CI). (B) Estimated narrow-sense heritability (red) with 95% confidence intervals (black bars) obtained by the mid-parent design stratified by the average decade of birth of the parents. (C) Correlation of a trait as a function of IBD under strict additive (h2, orange), squared (VAA, purple), and cubic (VAAA, green) epistasis architectures after dormancy adjustments. (D) Average longevity correlation as a function of IBD (black circles) grouped in 5% increments (gray: 95% CI) after adjusting for dominancy. A dashed line denotes the extrapolation of the models toward monozygotic twins from the Danish Twin Registry (red circle).

  • Fig. 4 Analysis of familial dispersion.

    (A) Median distance [log10(x + 1)] of father-offspring places of birth (cyan), mother-offspring (red), and marital radius (black) as a function of time (average year of birth). (B) Rate of change in the country of birth for father-offspring (cyan) or mother-offspring (red) stratified by major geographic areas. (C) Average IBD (log2) between couples as a function of average year of birth. Individual dots represent the measured average per year; the black line denotes the smooth trend using locally weighted regression. (D) IBD of couples as a function of marital radius. Each dot represents a year between 1650 to 1950. The blue line denotes the best linear regression line in log-log space.

    A time lapse of the birth places of the Geni data from 1400 to 1900 in jumps of five years. Each colored pixel corresponds to a genealogical profile and the intensity indicates the number of profiles. Prominent colonization events are noted.

