Research Article

Quantitative analysis of population-scale family trees with millions of relatives

See allHide authors and affiliations

Science  13 Apr 2018:
Vol. 360, Issue 6385, pp. 171-175
DOI: 10.1126/science.aam9309

Quantitative analysis of millions of relatives

Human relationships, as documented by family trees, can elucidate the heritability of a host of medical and biological parameters. Kaplanis et al. collected 86 million publicly available profiles from a crowd-sourced genealogy website and used them to examine the genetic architecture of human longevity and migration patterns (see the Perspective by Lussier and Keinan). Various models of inheritance suggested that life span is predominantly attributable to additive genetic effects, with a smaller component from dominant genetic inheritance. The data also suggested that relatedness between individuals is less attributable to advances in human transportation than to cultural changes.

Science, this issue p. 171; see also p. 153

Abstract

Family trees have vast applications in fields as diverse as genetics, anthropology, and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. We collected 86 million profiles from publicly available online data shared by genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of human longevity and to provide insights into the geographical dispersion of families. We also report a simple digital procedure to overlay other data sets with our resource.

Family trees are mathematical graph structures that can capture mating and parenthood among humans. As such, the edges of the trees represent potential transmission lines for a wide variety of genetic, cultural, sociodemographic, and economic factors. Quantitative genetics is built on dissecting the interplay of these factors by overlaying data on family trees and analyzing the correlation of various classes of relatives (13). In addition, family trees can serve as a multiplier for genetic information through study designs that leverage genotype or phenotype data from relatives (47), analyzing parent-of-origin effects (8), refining heritability measures (9, 10), or improving individual risk assessment (11, 12). Beyond classical genetic applications, large-scale family trees have played an important role across disciplines, including human evolution (13, 14), anthropology (15), and economics (16).

Despite the range of applications, constructing population-scale family trees has been a labor-intensive process. Previous approaches mainly relied on local data repositories such as churches or vital-records offices (14, 17, 18). But these approaches have limitations (19, 20): They require nontrivial resources to digitize the records and organize the data, the resulting trees are usually limited in geographical scope, and the data may be subject to strict usage protections. These challenges reduce demographic accessibility and complicate fusion with information such as genomic or health data.

Constructing and validating population-scale family trees

Here, we leveraged genealogy-driven social media data to construct population-scale family trees. To this end, we focused on Geni.com, a crowdsourcing website in the genealogy domain. Users can create individual profiles and upload family trees. The website automatically scans profiles to detect similarities and offers the option to merge the profiles when a match is detected. By merging, larger family trees are created that can be collaboratively comanaged to improve their accuracy. After obtaining relevant permissions, we downloaded approximately 86 million publicly available profiles (21). The input data consisted of millions of individual profiles, each of which describes a person; for 43 million of these profiles, the data also included any putative connections to other individuals in the data set. Similar to other crowdsourcing projects (22), a small group of participants contributed the majority of genealogy profiles (fig. S1).

We organized the profiles into graph topologies that preserve the genealogical relationships between individuals (Fig. 1A). Biology dictates that a family tree should form a directed acyclic graph, where each individual has an in-degree that is less than or equal to 2. However, 0.3% of the profiles resided in invalid biological topologies that included cycles (e.g., a person who is both the parent and child of another person) or an individual with more than two parents. We developed an automated pipeline to resolve local conflicts and prune invalid topologies (fig. S2) and benchmarked the performance of the pipeline against human genealogists (21). This resulted in >90% concordance between the pipeline and human decisions to resolve conflicts, thereby generating 5.3 million disjoint family trees.

Fig. 1 Overview of the collected data.

(A) The basic algorithmic steps to form valid pedigree structures from the input data available via the Geni API. Gray, profiles; red, marriages. See fig. S2 for a comprehensive overview. The last step shows an example of a real pedigree from the website with ~6000 individuals spanning about seven generations. (B) Size distribution of the largest 1000 family trees after data cleaning, sorted by size.

The largest family tree in the processed data spanned 13 million individuals who were connected by shared ancestry and marriage (Fig. 1B). On average, the tree spanned 11 generations between each terminal descendant and their founders (fig. S3). The size of this pedigree fits what is expected as familial genealogies coalesce at a logarithmic rate compared to the size of the population (23).

We evaluated the structure of the tree by inspecting the genetic segregation of unilineal markers. We obtained mitochondrial DNA (mtDNA) and Y-chromosome short tandem repeat (Y-STR) haplotypes to compare multiple pairs of relatives in our graph (21). The mtDNA data were available for 211 lineages and spanned a total of 1768 transmission events (i.e., graph edges), whereas the Y-STR data were available for 27 lineages that spanned 324 total transmission events. Using a prior of no more than a single nonpaternity event per lineage, we estimated a nonmaternity rate of 0.3% per meiosis and nonpaternity rate of 1.9% per meiosis. This rate of nonpaternity matched previous rates of Y-chromosome studies (24, 25) and the nonmaternity rate was close to historical rates of adoption of an unrelated member in the United States (26). Taken together, these results show that millions of genealogists can produce high-quality population-scale family trees.

Extracting demographic data

We found that life span in the Geni.com profiles was largely concordant with reports generated by traditional demographic approaches. First, we extracted demographic information from the collected profiles with exact birth and death dates, thereby avoiding the problems inherent in profiles with only year resolution for these events, such as heaping at round years (fig. S4). The data reflected historical events and trends, such as elevated death rates at military age during the American Civil War and First and Second World Wars and a reduction in child mortality during the 20th century (Fig. 2A). We compared the average life span in our collection to a worldwide historical analysis covering the years 1840 to 2000 (27). We found an R2 value of 0.95 between the expected life span from historical data and the Geni data set (Fig. 2B) and a 98% concordance with historical distributions reported by the Human Mortality Database (HMD) (Fig. 2C and fig. S5).

Fig. 2 Analysis and validation of demographic data.

(A) Distribution of life expectancy per year. Colors correspond to the frequency of profiles of individuals who died at a certain age for each year. Asterisks indicate deaths at military age in the Civil War and First and Second World Wars. (B) Expected life span in Geni (black) and the Oeppen and Vaupel study [red (27)] as a function of year of death. (C) Comparison of the life-span distributions versus Geni (black) and HMD (red). See also fig. S5A. (D) Geographic distribution of the annotated place-of-birth information. Every pixel corresponds to a profile in the data set. (E) Validation of geographical assignment by historical trends. Top: Cumulative distribution of profiles since 1500 for each city on a logarithmic scale as a function of time. Bottom: Year of first settlement in the city.

Next, we extracted the geographic locations of life events by two approaches: an automated geoparsing pipeline and structured text manually curated and approved by genealogists (21) (fig. S6A). Overall, we were able to place about 16 million profiles into longitude/latitude coordinates, typically at fine-scale geographic resolution, without major differences in quality between the automated geoparsing and manual curations for subsequent analyses (fig. S6B) (21). The profiles were distributed across a wide range of locations in the Western world (Fig. 2D and fig. S7), with 55% from Europe and 30% from North America. We analyzed profiles in 10 cities across the globe and found that the first appearance of profiles was only after the known first settlement date for nearly all of the cities, suggesting good spatiotemporal assignment of profiles (Fig. 2E). Movie S1 presents the place of birth of individuals in the Geni data set in 5-year intervals from 1400 to 1900 along with known migration events.

We were concerned that the Geni.com profiles might suffer from certain socioeconomic ascertainment biases and therefore would not reflect the local population. To evaluate this concern, we collected ~80,000 publicly available death certificates from the Vermont Department of Health for every death in that state between 1985 and 2010. These records have extensive information for each individual, including education level, place of birth, and a cause of death in an ICD-9 code. About 1000 individuals in Geni overlapped this death certificate collection. We compared the education level, birth state, and ICD-9 code between these ~1000 Geni profiles and the entire Vermont collection. For all three parameters, we found >98% concordance between the distribution of these key sociodemographic attributes in the Geni profiles in Vermont and the entire state of Vermont (tables S1 to S3). Overall, this high level of consistency argues against severe socioeconomic ascertainment. Table S4 reports key demographic and genetic attributes for various familial relationships from parent-child via great-great-grandparents to fourth cousins.

Characterizing the genetic architecture of longevity

We leveraged the Geni data set to characterize the genetic architecture of human longevity, which exhibits complex genetics likely to involve a range of physiological and behavioral endophenotypes (28, 29). Narrow-sense heritability (h2) of longevity has been estimated to be around 15 to 30% (table S5) (3035). Genome-wide association studies have had limited success in identifying genetic variants associated with longevity (3638). This relatively large proportion of missing heritability can be explained by the following: (i) Longevity has nonadditive components that create upward bias in estimates of heritability (39), (ii) estimators of heritability are biased as a result of unaccounted environmental effects (10), and (iii) the trait is highly polygenic and requires larger cohorts to identify the underlying variants (40). We thus sought to harness our resource and build a model for the sources of genetic variance in longevity that jointly evaluates additivity, dominance, epistasis, shared household effects, spatiotemporal trends, and random noise.

We adjusted longevity to be the difference between age of death and expected life span, using a model that we trained with 3 million individuals. Our model includes spatiotemporal and sex effects and was the best among 10 different models that adjusted various spatiotemporal attributes (fig. S8). We also validated this model by estimating h2 according to the mid-parent design (41) with nearly 130,000 parent-child trios. This process yielded h2mid-parent = 12.2% (SE = 0.4%) (Fig. 3A), which is on the lower end but in the range of previous heritability estimates (table S5). Consistent with previous studies, we did not observe any temporal trend in mid-parent heritability (Fig. 3B).

Fig. 3 The genetic architecture of longevity.

(A) Regression (red) of child longevity on its mid-parent longevity (defined as difference between age of death and expected life span). Black squares, average longevity of children binned by the mid-parent value; gray bars, estimated 95% confidence interval (CI). (B) Estimated narrow-sense heritability (red) with 95% confidence intervals (black bars) obtained by the mid-parent design stratified by the average decade of birth of the parents. (C) Correlation of a trait as a function of IBD under strict additive (h2, orange), squared (VAA, purple), and cubic (VAAA, green) epistasis architectures after dormancy adjustments. (D) Average longevity correlation as a function of IBD (black circles) grouped in 5% increments (gray: 95% CI) after adjusting for dominancy. A dashed line denotes the extrapolation of the models toward monozygotic twins from the Danish Twin Registry (red circle).

We partitioned the source of genetic variance of longevity using more than 3 million pairs of relatives from full sibling to fourth cousin (21). We measured the variance explained by an additive component, a pairwise epistatic model, three-way epistasis, and dominancy (Fig. 3C). These 3 million pairs were all sex-concordant to address residual sex differences not accounted for by our longevity adjustments (fig. S9) and do not include relatives who are likely to have died because of environmental catastrophes or in major wars (fig. S10); this mitigated correlations due to nongenetic factors. We also refined the genetic correlation of the relatives by considering multiple genealogical paths (figs. S11 to S13).

The analysis of longevity in these 3 million pairs of relatives showed a robust additive genetic component, a small impact of dominance, and no detectable epistasis (Fig. 3D and table S6) (21). Additivity was highly significant (Padditive < 10−318) with an estimated h2sex-concordant/relatives = 16.1% (SE = 0.4%), similar to the heritability estimated from sex-concordant parent-child pairs, h2concordant/parent-child = 15.0% (SE = 0.4%). The maximum-likelihood estimate for dominance was around 4%, but the epistatic terms converged to zero despite the substantial amount of data. Other model selection procedures, such as mean squared error analysis and Bayesian information criterion, argued against a pervasive epistatic contribution to longevity variance in the population (21).

We tested the ability of our model to predict the longevity correlation of an orthogonal data set of 810 monozygotic twin pairs collected by the Danish Twin Registry (Fig. 3D) (42). Our inferred model for longevity accurately predicted the observed correlation of this twin cohort with 1% difference, well within the sampling error for the mean twin correlation (SE = 3.2%). We also evaluated an extensive array of additional analyses that included various adjustments for environmental components and other confounders (figs. S14 and S15) (21). In all cases, additivity explained 15.8 to 16.9% of the longevity estimates, dominance explained 2 to 4%, and no evidence for epistatic interactions could be detected using our procedure.

We also estimated the additive and epistatic components using a method that allows rapid estimation of variance components of extremely large relationship matrices, called sparse Cholesky factorization linear mixed models (Sci-LMM) (43). This method takes into account a kinship coefficient matrix of 250 million pairs of related individuals in the Geni data set and includes adjustments for population structure, sex, and year of birth. We observed an additivity of 17.8% (SE = 0.84%) and a pairwise epistatic component that was not significantly different from zero (21).

Taken together, our results across multiple study designs (fig. S16) indicate that the limited ability of genome-wide association studies so far to associate variants with longevity cannot be attributed to statistical epistasis. Note that this does not rule out the existence of molecular interactions between genes contributing to this trait (4447). On the basis of a large number of data points and study designs, we measured an additive component (h2 ≈ 16%) that is considerably smaller than the 25% figure that is generally cited in the literature. These results indicate that previous studies are likely to have overestimated the heritability of longevity. As such, we should lower our expectations about our ability to predict longevity from genomic data and presumably to identify causal genetic variants.

Assessment of theories of familial dispersion

Familial dispersion is a major driving force of various genetic, economic, and demographic processes (48). Previous work has primarily relied on vital records from a limited geographical scope (49, 50) or used indirect inference from genetic data sets that mainly illuminate distant historical events (51).

We harnessed our resource to evaluate patterns of human migration. First, we analyzed sex-specific migration patterns (21) to resolve conflicting results regarding sex bias in human migration (52). Our results indicate that in Western societies, females migrate more than males but over shorter distances. Median mother-child distances were significantly larger than median father-child distances by a factor of 1.6 (Wilcox, one-tailed, P < 10−90) (Fig. 4A). This trend appeared throughout the 300 years of our analysis window, including in the most recent birth cohort, and was observed both in North American duos (Wilcox, one-tailed, P < 10−23) and European duos (Wilcox, one-tailed, P < 10−87). On the other hand, we found that average mother-child distances (fig. S17) were significantly shorter than average father-child distances (t test, P < 10−90), which suggests that long-range migration events are biased toward males. Consistent with this pattern, fathers displayed a significantly (P < 10−83) higher frequency than mothers to be born in a different country than their offspring (Fig. 4B). Again, this pattern was evident when restricting the data to North American or European duos. Taken together, males and females in Western societies show different migration distributions; patrilocality occurs only in relatively local migration events, and large-scale events that usually involve a change of country are more common in males than in females.

Fig. 4 Analysis of familial dispersion.

(A) Median distance [log10(x + 1)] of father-offspring places of birth (cyan), mother-offspring (red), and marital radius (black) as a function of time (average year of birth). (B) Rate of change in the country of birth for father-offspring (cyan) or mother-offspring (red) stratified by major geographic areas. (C) Average IBD (log2) between couples as a function of average year of birth. Individual dots represent the measured average per year; the black line denotes the smooth trend using locally weighted regression. (D) IBD of couples as a function of marital radius. Each dot represents a year between 1650 to 1950. The blue line denotes the best linear regression line in log-log space.

Next, we inspected the marital radius (the distance between mates’ places of birth) and its effect on the genetic relatedness of couples (21). The isolation-by-distance theory of Malécot predicts that increases in the marital radius should exponentially decrease the genetic relatedness of individuals (53). But the magnitude of these forces is also a function of factors such as taboos against cousin marriages (54).

We started by analyzing temporal changes in the birth locations of couples in our cohort. Before the Industrial Revolution (earlier than 1750), most marriages occurred between people born only 10 km from each other (Fig. 4A, black line). Similar patterns were found when analyzing European-born individuals (fig. S18) or North American–born individuals (fig. S19). After the beginning of the second Industrial Revolution (1870), the marital radius rapidly increased and reached ~100 km for most marriages in the birth cohort in 1950. Next, we analyzed the expected identity-by-descent (IBD) of couples as measured by tracing their genealogical ties (Fig. 4C). Between 1650 and 1850, the average IBD of couples was relatively stable and on the order of fourth cousins, whereas IBD exhibited a rapid decrease after 1850. Overall, the median marital radius for each year showed a strong correlation (R2 = 72%) with the expected IBD between couples. Every 70-km increase in the marital radius correlated with a decrease in the genetic relatedness of couples by one meiosis event (Fig. 4D). This correlation matches previous isolation-by-distance forces in continental regions (55). However, this trend is not consistent over time and exhibits three phases. For the pre-1800 birth cohorts, the correlation between marital distance and IBD was insignificant (P > 0.2) and weak (R2 = 0.7%) (fig. S20A). Couples born around 1800 to 1850 showed a doubling of their marital distance, from 8 km in 1800 to 19 km in 1850. Marriages usually occur about 20 to 25 years after birth, and around this time (1820 to 1875) rapid transportation changes took place, such as the advent of railroad travel in most of Europe and the United States. However, the increase in marital distance was significantly (P < 10−13) coupled with an increase in genetic relatedness, contrary to the isolation-by-distance theory (fig. S20B). Only for the cohorts born after 1850 did the data match (R2 = 80%) the theoretical model of isolation by distance (fig. S20C).

Taken together, the data show a 50-year lag between the advent of increased familial dispersion and the decline of genetic relatedness between couples. During this time, individuals continued to marry relatives despite the increased distance. From these results, we hypothesize that changes in 19th-century transportation were not the primary cause for decreased consanguinity. Rather, our results suggest that shifting cultural factors played a more important role in the recent reduction of genetic relatedness of couples in Western societies.

Discussion

In this work, we leveraged genealogy-driven media to build a data set of human pedigrees of massive scale that covers nearly every country in the Western world. Multiple validation procedures indicated that it is possible to obtain a data set that has similar quality to traditionally collected studies, but at much greater scale and lower cost.

We envision that this and similar large data sets can address quantitative aspects of human families, including genetics, anthropology, public health, and economics. Our tree and demographic data are available in a de-identified format, enabling static analysis of the Geni data set. We also offer a dynamic method that enables fusing other data sets with our data, based on digital consent of participants using the Geni application programming interface (API) (fig. S21) (21). We have been using this one-click mechanism to overlay thousands of genomes with family trees on DNA.Land (56). Other projects can use a similar strategy to add large pedigrees to their existing data collection.

More generally, similar to previous studies (57, 58), our work demonstrates the synergistic power of a collaboration between basic research and consumer genetic genealogy data sets. With ever-growing digitization of humanity and the rise of consumer genetics (59), we believe that such collaborative efforts can be a valuable path to reach the scale of information needed to address fundamental questions in biomedical research.

Supplementary Materials

www.sciencemag.org/content/360/6385/171/suppl/DC1

Materials and Methods

Figs. S1 to S21

Tables S1 to S6

Movie S1

References (6079)

References and Notes

  1. See supplementary materials.
Acknowledgments: We thank D. Zielinski, G. Japhet, and J. Novembre for valuable comments, the Erlich lab members for constant support in pursuing this project, and the Vermont Health Department for providing all death certificates. This study was supported by a generous gift from Andria and Paul Heafy (Y.E.), the Burroughs Wellcome Fund Career Awards at the Scientific Interface (Y.E.), the Broad Institute’s SPARC: Catalytic Funding for Novel Collaborative Projects award (Y.E. and D.G.M.), NIH grants R01 MH101244 and R03 HG006731 (A.L.P.), and Israeli Science Foundation grant 1678/12 (D.G.). Author contributions: A.G. and Y.E. conducted the downloading, indexing, and organizing of the data; J.K., A.G., M.W., B.M., M.Ge., M.S., and Y.E. developed the procedures to clean the family trees and extract demographic information; J.K., T.S., O.W., D.G., M.Gy., G.B., D.G.M., A.L.P., and Y.E. were involved in analyzing the genetic architecture of longevity; J.K., M.W., and Y.E. conducted the analysis of human migration; and J.K., T.S., O.W., D.G.M., A.L.P., and Y.E. wrote the manuscript. T.S. and Y.E. became employees of MyHeritage.com, the parent company of Geni.com, during the course of this study. The other authors do not declare relevant competing interests. The Geni data set without names is available from Y.E. under the terms described on FamiLinx.org. The code for the API integration is available at https://github.com/TeamErlich/geni-integration-example, the code for Sci-LMM is available at https://github.com/TalShor/SciLMM, and the code to download Geni profiles is available at https://github.com/erlichya/geni-download. The Human Mortality Database (HMD) is available at www.mortality.org. The Danish Twin Registry (DTR) data are available upon request from the University of Southern Denmark (www.sdu.dk/en/om_sdu/institutter_centre/ist_sundhedstjenesteforsk/centre/dtr). The findings, opinions, and recommendations expressed herein are those of the authors and are not necessarily those of the DTR. The Vermont Death Certificate collection was obtained upon request from the Chief of Public Health Statistics, Vermont Department of Health (www.healthvermont.gov/stats).
View Abstract

Subjects

Navigate This Article