Mapping the Origins and Expansion of the Indo-European Language Family

See allHide authors and affiliations

Science  24 Aug 2012:
Vol. 337, Issue 6097, pp. 957-960
DOI: 10.1126/science.1219669

This article has a correction. Please see:


There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes about 6000 years ago. An alternative hypothesis claims that the languages spread from Anatolia with the expansion of farming 8000 to 9500 years ago. We used Bayesian phylogeographic approaches, together with basic vocabulary data from 103 ancient and contemporary Indo-European languages, to explicitly model the expansion of the family and test these hypotheses. We found decisive support for an Anatolian origin over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning 8000 to 9500 years ago. These results highlight the critical role that phylogeographic inference can play in resolving debates about human prehistory.

Model-based methods for Bayesian inference of phylogeny have been applied to comparative basic vocabulary data to infer ancestral relationships between languages (13). Such studies have focused on the use of subgrouping and time-depth estimates to test competing hypotheses, but they lack explicit geographic models of language expansion. Here, we used two novel quantitative phylogeographic inference tools derived from stochastic models in evolutionary biology to tackle the “most recalcitrant problem in historical linguistics” (4)—the origin of the Indo-European languages. The “steppe hypothesis” posits an origin in the Pontic steppe region north of the Caspian Sea. Although the archaeological record provides a number of candidate expansions from this area (5), a steppe homeland is most commonly linked to evidence of an expansion into Europe and the Near East by Kurgan seminomadic pastoralists beginning 5000 to 6000 years ago (57). Evidence from “linguistic paleontology”—an approach in which terms reconstructed in the ancestral “proto-language” are used to make inferences about its speakers’ culture and environment—and putative early borrowings between Indo-European and the Uralic language family of northern Eurasia (8) are cited as possible evidence for a steppe homeland (9). However, the reliability of inferences derived from linguistic paleontology and claimed borrowings remains uncertain (5, 10). The alternative “Anatolian hypothesis” holds that Indo-European languages spread with the expansion of agriculture from Anatolia (in present-day Turkey), beginning 8000 to 9500 years ago (11). Estimates of the age of the Indo-European family derived from models of vocabulary evolution support the chronology implied by the Anatolian hypothesis, but the inferred dates remain controversial (5, 10, 12), and the implied models of geographic expansion under each hypothesis remain untested.

To test these two hypotheses, we adapted and extended a Bayesian phylogeographic inference framework developed to investigate the origin of virus outbreaks from molecular sequence data (13, 14). We used this approach to analyze a data set of basic vocabulary terms and geographic range assignments for 103 ancient and contemporary Indo-European languages (1517). Following previous work that applied Bayesian phylogenetic methods to linguistic data (13), we modeled language evolution as the gain and loss of “cognates” (homologous words) through time (1820). We combined phylogenetic inference with a relaxed random walk (RRW) (14) model of continuous spatial diffusion along the branches of an unknown, yet estimable, phylogeny to jointly infer the Indo-European language phylogeny and the most probable geographic ranges at the root and internal nodes. This phylogeographic approach treats language location as a continuous vector (longitude and latitude) that evolves through time along the branches of a tree and seeks to infer ancestral locations at internal nodes on the tree while simultaneously accounting for uncertainty in the tree.

To increase the realism of the spatial diffusion, our method extends the RRW process in two ways. First, to reduce potential bias associated with assigning point locations to sampled languages, we use geographic ranges of the languages to specify uncertainty in the location assignments. Second, to account for geographic heterogeneity, we accommodate spatial prior distributions on the root and internal node locations. By assigning zero probability to node locations over water, we can incorporate into the analysis prior information about the shape of the Eurasian landmass.

The estimated posterior distribution for the location of the root of the Indo-European tree under the RRW model is shown in Fig. 1A. The distribution for the root location lies in the region of Anatolia in present-day Turkey. To quantify the strength of support for an Anatolian origin, we calculated the Bayes factors (21) comparing the posterior to prior odds ratio of a root location within the hypothesized Anatolian homeland (11) (Fig. 1, yellow polygon) with two versions of the steppe hypothesis—the initial proposed Kurgan steppe homeland (6) and a later refined hypothesis (7) (Table 1). Bayes factors show strong support for the Anatolian hypothesis under a RRW model. This model allows large variation in rates of expansion and so is sufficiently flexible to fit the alternative hypothesis if the data support it. Further, the geographic centroid of the languages considered here falls within the broader steppe hypothesis (Fig. 1, green star), indicating that our model is not simply returning the center of mass of the sampled locations, as would be predicted under a simple diffusion process that ignores phylogenetic information and geographic barriers.

Fig. 1

Inferred geographic origin of the Indo-European language family. (A) Map showing the estimated posterior distribution for the location of the root of the Indo-European language tree under the RRW analysis. Markov chain Monte Carlo (MCMC) sampled locations are plotted in translucent red such that darker areas correspond to increased probability mass. (B) The same distribution under a landscape-based analysis in which movement into water is less likely than movement into land by a factor of 100 (see fig. S5 for results under the other landscape-based models). The blue polygons delineate the proposed origin area under the steppe hypothesis; dark blue represents the initial suggested Kurgan homeland (6) (steppe I), and light blue denotes a later version of the steppe hypothesis (7) (steppe II). The yellow polygon delineates the proposed origin under the Anatolian hypothesis (11). A green star in the steppe region shows the location of the centroid of the sampled languages.

Table 1

Bayes factors comparing support for the Anatolian and steppe hypotheses. We estimated Bayes factors directly, using expectations of a root model indicator function taken over the MCMC samples drawn from the posterior and prior of each hypothesis. Bayes factors greater than 1 favor an Anatolian origin. A Bayes factor of 5 to 20 is taken as substantial support, greater than 20 as strong support, and greater than 100 as decisive (30).

View this table:

Our results incorporate phylogenetic uncertainty given our data and model and so are not contingent on any single phylogeny. However, phonological and morphological data have been interpreted to support an Indo-European branching structure that differs slightly from the pattern we find, particularly near the base of the tree (16). If we constrain our analysis to fit with this alternative pattern of diversification, we find even stronger support for an Anatolian origin (in terms of Bayes factors, BFSteppe I = 216; BFSteppe II = 227) (15).

As the earliest representatives of the main Indo-European lineages, our 20 ancient languages might provide more reliable location information. Conversely, the position of the ancient languages in the tree, particularly the three Anatolian varieties, might have unduly biased our results in favor of an Anatolian origin. We investigated both possibilities by repeating the above analyses separately on only the ancient languages and only the contemporary languages (which excludes Anatolian). Consistent with the analysis of the full data set, both analyses still supported an Anatolian origin (Table 1).

The RRW approach avoids internal node assignments over water, but it does assume, along the unknown tree branches, the same underlying migration rate across water as across land. To investigate the robustness of our results to heterogeneity in rates of spatial diffusion, we developed a second inference procedure that allows migration rates to vary over land and water (15). This landscape-based model allows for the inclusion of a more complex diffusion process in which rates of migration are a function of geography. We examined the effect of varying relative rate parameters to represent a range of different migration patterns (15). Figure 1B shows the inferred Indo-European homeland under a model in which migration from land into water is less likely than from land to land by a factor of 100. At the other extreme, we fit a “sailor” model with no reluctance to move into water and rapid movement across water. Consistent with the findings based on the RRW model, each of the landscape-based models supports the Anatolian farming theory of Indo-European origin (Table 1).

Our results strongly support an Anatolian homeland for the Indo-European language family. The inferred location (Fig. 1) and timing [95% highest posterior density (HPD) interval, 7116 to 10,410 years ago] of Indo-European origin is congruent with the proposal that the family began to diverge with the spread of agriculture from Anatolia 8000 to 9500 years ago (11). In addition, the basal relationships in the tree (Fig. 2, inset, and figs. S1 and S2) and geographic movements these imply are also consistent with archaeological evidence for an expansion of agriculture into Europe via the Balkans, reaching the edge of western Europe by 5000 years ago (22). This scenario fits with genetic (2325) and craniometric (26) evidence for a Neolithic, Anatolian contribution to the European gene pool. An expansion of Indo-European languages with agriculture is also in line with similar explanations for language expansion in the Pacific (2), Southeast Asia (27), and sub-Saharan Africa (28), adding weight to arguments for the key role of agriculture in shaping global linguistic diversity (4).

Fig. 2

Map and maximum clade credibility tree showing the diversification of the major Indo-European subfamilies. The tree shows the timing of the emergence of the major branches and their subsequent diversification. The inferred location at the root of each subfamily is shown on the map, colored to match the corresponding branches on the tree. Albanian, Armenian, and Greek subfamilies are shown separately for clarity (inset). Contours represent the 95% (largest), 75%, and 50% HPD regions, based on kernel density estimates (15).

Despite support for an Anatolian Indo-European origin, we think it unlikely that agriculture serves as the sole driver of language expansion on the continent. The five major Indo-European subfamilies—Celtic, Germanic, Italic, Balto-Slavic, and Indo-Iranian—all emerged as distinct lineages between 4000 and 6000 years ago (Fig. 2 and fig. S1), contemporaneous with a number of later cultural expansions evident in the archaeological record, including the Kurgan expansion (57). Our inferred tree also shows that within each subfamily, the languages we sampled began to diversify between 2000 and 4500 years ago, well after the agricultural expansion had run its course. Figure 2 plots the inferred geographic origin of languages sampled from each subfamily under the RRW model. The interpretation of these results is straightforward when all the main branches of a subfamily are represented in the sample. In cases where there are branches not represented, such as Continental Celtic, the inferred time depths and locations may not correspond to the origin of all known languages in a subfamily. Because we know that the Romance languages in our sample are descended from Latin, this group presents a useful test case of our methodology. Our model correctly assigns high posterior support to the most recent common ancestor of contemporary Romance languages around Rome (fig. S3). Using this approach, we may therefore be able to test between more recent origin hypotheses pertaining to individual subgroups. Moreover, by combining the time-depth and location estimates across all internal nodes, we can generate a picture of the expansion of all Indo-European languages across the landscape (fig. S4 and movie S1).

Language phylogenies provide insights into the cultural history of their speakers (13, 28, 29). Our analysis of ancient and contemporary Indo-European languages shows that these insights can be made even more powerful by explicitly incorporating spatial information. Linguistic phylogeography enables us to locate cultural histories in space and time and thus provides a rigorous analytic framework for the synthesis of archaeological, genetic, and cultural data.

Supplementary Materials

Materials and Methods

Figs. S1 to S12

Tables S1 to S5

References (3162)

Movie S1

BEAST input file

NEXUS tree file

References and Notes

  1. See supplementary materials on Science Online.
  2. Acknowledgments: We thank the New Zealand Phylogenetics Meeting and the National Evolutionary Synthesis Center (NESCent) NSF grant EF-0423641, for fostering collaboration on this project. Supported by the Marsden Fund (R.B., R.D.G., S.J.G., and A.J.D.), Rutherford Discovery Fellowships (Q.D.A., A.J.D.), administered by the Royal Society of New Zealand, and by NIH grants R01 GM086887 and R01 HG006139 (M.A.S.). The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement 278433-PREDEMICS and European Research Council (ERC) grant agreement 260864.
View Abstract

Stay Connected to Science

Navigate This Article