Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa

See allHide authors and affiliations

Science  15 Apr 2011:
Vol. 332, Issue 6027, pp. 346-349
DOI: 10.1126/science.1199295


Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a serial founder effect in which successive population bottlenecks during range expansion progressively reduce diversity, underpinning support for an African origin of modern humans. Recent work suggests that a similar founder effect may operate on human culture and language. Here I show that the number of phonemes used in a global sample of 504 languages is also clinal and fits a serial founder–effect model of expansion from an inferred origin in Africa. This result, which is not explained by more recent demographic history, local language diversity, or statistical non-independence within language families, points to parallel mechanisms shaping genetic and linguistic diversity and supports an African origin of modern human languages.

The number of phonemes—perceptually distinct units of sound that differentiate words—in a language is positively correlated with the size of its speaker population (1) in such a way that small populations have fewer phonemes. Languages continually gain and lose phonemes because of stochastic processes (2, 3). If phoneme distinctions are more likely to be lost in small founder populations, then a succession of founder events during range expansion should progressively reduce phonemic diversity with increasing distance from the point of origin, paralleling the serial founder effect observed in population genetics (49). A founder effect has already been used to explain patterns of variation in other cultural replicators, including human material culture (1013) and birdsong (14). A range of possible mechanisms (15) predicts similar dynamics governing the evolution of phonemes (11, 16) and language generally (1720). This raises the possibility that the serial founder–effect model used to trace our genetic origins to a recent expansion from Africa (49) could also be applied to global phonemic diversity to investigate the origin and expansion of modern human languages. Here I examine geographic variation in phoneme inventory size using data on vowel, consonant, and tone inventories taken from 504 languages in the World Atlas of Language Structures (WALS) (21), together with information on language location, taxonomic affiliation, and speaker demography (Fig. 1 and table S1) (15).

Fig. 1

Language locations and regional variation in phonemic diversity. (A) Map showing the location of the 504 sampled languages for which phoneme data was compiled from the WALS database. (B) Box plots of overall phonemic diversity by region reveal substantial regional variation (χ2 = 188.7, df = 5, P < 0.001), with the highest diversity in Africa and the lowest diversity in Oceania and South America. The same regional pattern also applies at the language family level (fig. S2).

Consistent with previous work (1), speaker population size is a significant predictor of phonemic diversity (Pearson’s correlation r = 0.385, df = 503, P < 0.001), with smaller population size predicting smaller overall phoneme inventories (fig. S1A). The same relationship holds for vowel (r = 0.378, df = 503, P < 0.001) and tone (r = 0.230, df = 503, P < 0.001) inventories separately, with a weaker, though still significant, effect of population size on consonant diversity (r = 0.131, df = 503, P = 0.003). To account for any non-independence within language families, the analysis was repeated, first using mean values at the language family level (table S2) and then using a hierarchical linear regression framework to model nested dependencies in variation at the family, subfamily, and genus levels (15). These analyses confirm that, consistent with a founder effect model, smaller population size predicts reduced phoneme inventory size both between families (family-level analysis r = 0.468, df = 49, P < 0.001; fig. S1B) and within families, controlling for taxonomic affiliation {hierarchical linear model: fixed-effect coefficient (β) = 0.0338 to 0.0985 [95% highest posterior density (HPD)], P = 0.009}.

Figure 1B shows clear regional differences in phonemic diversity, with the largest phoneme inventories in Africa and the smallest in South America and Oceania. A series of linear regressions was used to predict phoneme inventory size from the log of speaker population size and distance from 2560 potential origin locations around the world (15). Incorporating modern speaker population size into the model controls for geographic patterning in population size and means that the analysis is conservative about the amount of variation attributed to ancient demography. Model fit was evaluated with the Bayesian information criterion (BIC) (22). Following previous work (5, 6), the set of origin locations within four BIC units of the best-fit location was taken to be the most likely area of origin under a serial founder–effect model.

The origin locations producing the strongest decline in phonemic diversity and best-fit model lie across central and southern Africa (Fig. 2A). This region could represent either a single origin for modern languages or the main origin under a polygenesis scenario. The best-fit model incorporating population size and distance from the origin explains 31% of the variance in phoneme inventory size [correlation coefficient (R) = 0.558, F2,501 = 113.463, P < 0.001] (Fig. 3). Both population size (rpopulation = 0.146, P = 0.002) and distance from origin (rdistance = –0.438, P < 0.001) are significant predictors in the model. Controlling for population size, distance from origin accounts for 19% of the variance in phonemic diversity. A model using only distance as a predictor gives a broadly equivalent origin area (fig. S3) and explains 30% of the variation in phonemic diversity (r = 0.545, P < 0.001). The relationship also holds for vowel (r = –0.394, P < 0.001), consonant (r = –0.260, P < 0.001), and tone diversity (r = –0.391, P < 0.001) separately.

Fig. 2

Likely area of language origin. Maps show the likely location of a single language origin under a founder effect model of phonemic diversity (controlling for population size) inferred from (A) individual languages and (B) mean diversity across language families. Lighter shading implies a stronger inverse relationship between phonemic diversity and distance from the origin and better fit of the model, as measured by the BIC. The most likely region of origin, comprising those locations within four BIC units of the best-fit origin location, is the area of lightest shading outlined in bold.

Fig. 3

Phonemic diversity versus distance from the best-fit origin in Africa. A plot of distance from the best-fit origin location in Africa against overall phoneme inventory size is shown. Distance from the origin alone explains 30% of the variation in phonemic diversity (fitted line; r = –0.545, n = 504 languages, P < 0.001) and 19.2% of the variation after controlling for modern speaker population size (rdistance = –0.438, P < 0.001; rpopulation = 0.146, P < 0.001; R = 0.558, F2,501 = 113.463, P < 0.001).

To account for relatedness within families, I repeated the above regressions using mean values across language families (table S2) and under a hierarchical linear model comprising the three taxonomic levels recorded in WALS (15). The hierarchical model results closely matched those of the individual language analysis (fig. S4). Adding an interaction effect did not significantly improve model fit, indicating that the patterns reported here reflect a consistent trend that holds across the globe. The family-level analysis was consistent with the individual language analyses, although the credible region of origin is expanded to include all of Africa (Fig. 2B). Distance from the best fit origin (rdistance = –0.401, P = 0.004) and population size (rpopulation = 0.300, P = 0.036) are both significant predictors and account for 39% of the variance in phonemic diversity between families (R = 0.627, F2,47 = 15.190, P < 0.001; fig. S5). As a further test of the robustness of these findings, individual regressions were repeated using partial Mantel tests, which allow for non-independence between data points and avoid assumptions about the statistical distributions underlying the variables of interest (15). The results of this analysis matched the findings reported above (table S3).

To examine the possibility of language polygenesis, distance from a second origin location was added as a predictor to a model incorporating population size and distance from the best-fit origin in Africa. The best-fit models in this analysis did not show a significant negative correlation between distance from a second origin and phonemic diversity. Restricting the analysis to second origin locations that do show an inverse relationship, a region of best fit can be identified in South America (fig. S6). However, this pattern does not appear under the hierarchical linear model or language family–level analysis; adding a second origin does not improve the fit of either model as measured by the BIC, and all putative second origin locations are within four BIC units. The area identified in the individual-level analysis may be an artefact of slightly higher diversity levels in the Americas than in Oceania, despite the two being comparable distances from Africa, possibly because of a stronger founder effect across the remote Pacific (1820, 23). When the languages of Oceania are removed from the individual analysis, the effect of distance from an African origin remains (rdistance1 = –0.447, P < 0.001) but there is no significant effect of distance from the secondary origin (rdistance2 = –0.065, P = 0.192). As expected if language spread south with the initial colonization of the Americas, distance from the Bering Strait is inversely correlated with phoneme inventory size within the Americas after controlling for population size (rdistance = –0.173, P = 0.043).

An ostensibly global cline in phonemic diversity supporting an expansion from Africa could also arise as an artefact of a series of more recent expansions after the Last Glacial Maximum (LGM) into northern Eurasia, the Americas, and the remote Pacific. Languages in these regions show lower average phonemic diversity than in the rest of the world (t = –6.597, df = 503, P < 0.001), which is consistent with a more recent colonization. However, expansion after the LGM does not account for the global cline in phonemic diversity. Distance from Africa remains a significant predictor of phonemic diversity after controlling for colonization since the LGM (rdistance = –0.401, P < 0.001; rpopulation = 0.152, P = 0.001; rLGM = 0.032, P = 0.419), as well as when these more recently colonized areas are excluded from the analysis altogether (rdistance = –0.511, P < 0.001; rpopulation = 0.253, P < 0.001) (19).

Demographic factors other than population size may also influence phonemic diversity, particularly those affecting levels of contact and borrowing between groups of speakers. Because neighboring populations at similar points in an expansion are more likely to have similar phonemes and levels of diversity, moderate horizontal transfer between populations can maintain a cline, as has been the case for human genetic diversity (24). However, geographic variation in language diversity (the number of languages per unit of area), population density (the number of speakers per unit of area), or language area (the total area over which a language is spoken) could affect regional phonemic diversity by increasing contact within and between groups and creating more opportunities to borrow new phonemes. To test whether any such effect could explain the observed global cline in phonemic diversity, these additional measures were included, together with population size and distance from the best-fit origin in Africa, in a regression model predicting phoneme inventory size (15). Controlling for other demographic variables in this way, sub-Saharan Africa remains the most likely area of origin (fig. S7). Distance from the best-fit origin location is a significant predictor at the individual language level (rdistance = –0.413, P < 0.001), family level (rdistance = –0.384, P < 0.008), and in the hierarchical linear model [β = –3.419 × 10−5 to –2.223 × 10−5 (95% HPD), P < 0.001]. The demographic variables are highly correlated and did not show significant independent effects on phonemic diversity. Stepwise regressions indicated that a model incorporating distance from Africa, population size, and (at the individual language level only) language area best explained phoneme inventory size (15).

The single major cline in phonemic diversity is consistent with a linguistic founder effect operating under conditions of rapid expansion from a most likely origin in Africa. This supports a picture of language spread that is congruent with similar analyses of human genetic (4, 5, 7, 8) and phenotypic (6, 9) diversity. Phonemic diversity appears to be highly stable within major language families (15), indicating that, despite the many sociolinguistic processes at work (2, 3), robust statistical patterns in global variation can persist for many millennia and could plausibly reflect a time scale on the order of the African exodus. Outside Africa, the highest levels of phonemic diversity are found in language families thought to be autochthonous to Southeast Asia. This also fits with genetic evidence (25, 26), indicating that Southeast Asia experienced particularly pronounced population growth immediately after the African exodus, meaning that languages in this region should have been least affected by population bottlenecks and would have had the most time to recover diversity.

Although distance from Africa explains much less of the variation in phonemic diversity (19%) than in neutral genetic markers (80 to 85%) (5, 7), the effect is comparable to that obtained from analysis of human mitochondrial DNA (18%) (8) or phenotypic data (14 to 28%) (6). To the extent that language can be taken as an example of cultural evolution more generally, these findings support the proposal that a cultural founder effect operated during our colonization of the globe, potentially limiting the size and cultural complexity of societies at the vanguard of the human expansion (10, 11). An origin of modern languages predating the African exodus 50,000 to 70,000 years ago puts complex language alongside the earliest archaeological evidence of symbolic culture in Africa 80,000 to 160,000 years ago (27, 28). Truly modern language, akin to languages spoken today, may thus have been the key cultural innovation that allowed the emergence of these and other hallmarks of behavioral modernity and ultimately led to our colonization of the globe (29).

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 to S8

Tables S1 to S4


References and Notes

  1. Materials and methods are available as supporting material on Science Online.
  2. Thanks to M. Pagel, O. Curry, R. Dunbar, M. Dunn, R. Gray, S. Greenhill, M. Grove, S. Roberts, R. Ross, and S. Schultz for useful advice and/or comments on the manuscript. I declare no competing financial interest.
View Abstract

Stay Connected to Science

Navigate This Article