Structural Phylogenetics and the Reconstruction of Ancient Language History

See allHide authors and affiliations

Science  23 Sep 2005:
Vol. 309, Issue 5743, pp. 2072-2075
DOI: 10.1126/science.1114615


The contribution of language history to the study of the early dispersals of modern humans throughout the Old World has been limited by the shallow time depth (about 8000 ± 2000 years) of current linguistic methods. Here it is shown that the application of biological cladistic methods, not to vocabulary (as has been previously tried) but to language structure (sound systems and grammar), may extend the time depths at which language data can be used. The method was tested against well-understood families of Oceanic Austronesian languages, then applied to the Papuan languages of Island Melanesia, a group of hitherto unrelatable isolates. Papuan languages show an archipelago-based phylogenetic signal that is consistent with the current geographical distribution of languages. The most plausible hypothesis to explain this result is the divergence of the Papuan languages from a common ancestral stock, as part of late Pleistocene dispersals.

The linguistic comparative method used to construct language family trees relies on recognizing “cognate sets”: words in different languages that are related in meaning and form because they can be shown to have the same ultimate source in an ancestor language. The comparative method has helped define the major linguistic family groups that are recognized today. Unfortunately, because of the continual process of linguistic change, the method is limited to a time depth of approximately 8000 ± 2000 years (1). However, it is probable that a considerable portion of linguistic diversification occurred at earlier dates, associated with later Pleistocene human dispersals. Alternative attempts to reach further back and link the world's ∼300 language families (2) into larger taxonomic units are controversial (35).

One example of this older diversification may be found in Island Melanesia. Radiocarbon dating for Island Melanesia has demonstrated Pleistocene occupation more than 35,000 years ago (6, 7) (Fig. 1). Evidence suggests high levels of inter- and intrapopulational genetic variation (8, 9), with no simple relationship with linguistic patterns. The languages spoken in the area are of two groups: (i) over 100 languages belonging to four groups of the well-established Austronesian family, which probably originated in the area close to Taiwan and spread to this region about 4000 years ago (10); and (ii) 23 “Papuan” languages, which are not known to have any phylogenetic relation to one another and are of much greater antiquity in the region.

Fig. 1.

Island Melanesia, showing the distribution of the Western Oceanic (Austronesian) (triangles) and Papuan (diamonds) languages used in the sample.

The lexical evidence for relationships between Papuan languages is minimal. Apart from shared Austronesian loans, there are few plausible cognate candidates found in comparisons of pairs of words from Papuan vocabularies (Fig. 2) [see, however, (11)]. Assuming that the rate of vocabulary loss in the Papuan languages is similar to rates observed elsewhere, these languages are either unrelated or have been separated at least since the early Holocene or late Pleistocene. These languages do, however, show a high degree of structural similarity, distinguishing them as a group from their Austronesian neighbors, which has led scholars to propose genealogical (or near-genealogical) groupings (12, 13). In the absence of identifiable lexical cognates, we have used computational cladistic analysis of these features of linguistic structure to test whether a phylogenetic signal can be identified beyond the resolution of lexical form-based methods [for other cladistic methods using lexicons, see (1421)]. The structural features of a language, like the lexicon, are subject to processes of decay over time and can also be borrowed or exchanged across languages. However, such exchange usually only occurs under special conditions of prolonged and intensive contact, and it is at least plausible that where the lexical signal has been lost, a faint structural signal might still be discernible. Linguistic structure—that is, grammar rather than vocabulary—has previously been used in historical linguistics to show statistical evidence for ancient links between languages from different parts of the world (1, 2, 22, 23) but not directly to reconstruct phylogenetic relationships.

Fig. 2.

The transparency of cognates in three dispersed Austronesian versus four close Papuan languages (Austronesian cognates/loanwords are shown in italics). The three Papuan languages have an apparent level of 3 to 5% shared vocabulary in a standard 200-word list (29). Using a scrambling test, the word list for each language was randomly reordered, and apparent lexeme correspondences were recounted. The level of apparent cognacy on this random list was exactly the same as on the correctly sorted list, demonstrating that the amount of apparently shared lexicon between any pair of Papuan languages is not greater than chance.

A questionnaire-based database was constructed, in which linguistic structural features were coded for their presence or absence in each of the target languages. These characters were abstract (coded without respect to their formal expression) and were selected to provide broad typological coverage, reflecting the known linguistic variation of the region (24), as well as to be features that would typically be described in a published sketch grammar. Traits invariant in the region (either entirely absent, such as polysynthesis or proximate/obviative case distinctions; or present in all the languages, such as the existence of a word class of verbs) were not coded. Characters that show strong implicational correlations were excluded, although characters with weaker tendencies to covariance were not excluded where the current state of linguistic typological knowledge does not allow us to systematically distinguish functionally motivated covariance from phylogenetic or areal patterns. The completed data matrix contained 125 binary features coded for 15 Papuan and 16 Austronesian languages spoken in an overlapping region. The Papuan database was mostly compiled by linguists with field experience in the language and was supplemented from published and unpublished sources where available. The Austronesian database was constructed from published sources (25). All sets of data were checked by a second coder to ensure consistency.

The binary-coded linguistic features allowed us to treat these as character traits distributed among taxonomic units (languages) and thus to apply cladistic algorithms (maximum parsimony or NeighborNet) to determine potential phylogenetic relationships among them (26).

The hypothesis that grammatical structure retained a phylogenetic signature was first tested among 16 languages belonging to the Meso-Melanesian, Papuan Tip, and North New Guinea linkages, three sister clades within the Western Oceanic subgroup of Austronesian, the relationship of which has been established by the comparative method (10, 27) {although not completely unambiguously, because there is lexical evidence in particular that the Papuan Tip and the North New Guinea linkages had a period of shared history after their separation from Meso-Melanesian [(10), p. 101]}. We carried out a parsimony analysis on the structural data from these languages, from which we obtained a consensus tree [tree length, 224 steps; consistency index (CI) = 0.42; rescaled consistency index (RC) = 0.19; retention index (RI) = 0.46]. When this tree (Fig. 3, right) is compared with the classification based on the comparative method (Fig. 3, left), there is a close match. In the consensus tree, the Meso-Melanesian group forms a major branch. Papuan Tip and North New Guinea together form a clade, with the North New Guinea linkage nested as a subclade within it. This is consistent with uncertainties in the linguistic reconstruction. The internal structure of the Meso-Melanesian group is quite flat, but all except one of the clades posited by the comparative method are congruently represented in the consensus tree. These results show that cladistically analyzed grammatical structure can preserve a signal that is consistent with a known phylogeny derived by traditional lexical techniques.

Fig. 3.

Phylogenetic relationships among two taxa of the Western Oceanic subgroup of the Austronesian language family. (Left) Reconstructed phylogeny of the languages of the Meso-Melanesian, Papuan Tip, and North New Guinea groups based on the linguistic comparative method (10, 27). (Right) Unrooted parsimony tree showing relationships among the Meso-Melanesian and Papuan Tip groups based on grammatical traits only (that is, discarding abundant lexical evidence) (the figure shows reweighted and raw bootstrap values). The two trees show a high degree of concordance, with monophyly in both major taxa and the similar geographical structuring of within-taxon diversity.

On the basis of this result, we applied the same method to a set of languages in which lexical similarities are not present. Taking 15 Papuan languages for which we have full structural data and applying the same methods, we obtained a consensus tree of the most parsimonious cladograms for the bootstrapped data set (Fig. 4). This tree has a tree length of 349 steps, CI = 0.35, RC = 0.14, and RI = 0.39. The results show a remarkably geographically consistent pattern: The major clades represent archipelagos, and within each archipelago nearest neighbors tend to form sister clades, despite a nearly complete absence of lexical relatedness.

Fig. 4.

Maximum parsimony tree of Island Melanesian Papuan languages with reweighted and raw bootstrap values. The tree shows a high level of geographic patterning by island group. Solomon Island languages are intermediate between Bougainville and Bismarck Archipelago languages, which is in violation of geographic progression.

Interpretation is problematic, because there are no generally accepted independent linguistic criteria for assessing the Papuan trees. One possibility is that these trees reflect contact with local Austronesian neighbors, providing an areal rather than phylogenetic signal. In experiments, combined Austronesian-Papuan consensus trees were in some cases intermeshed, but the result was statistically weak (28). Because Papuan and Austronesian are very unlikely to be genuine sister clades, a high degree of homoplasy can be the result of either contact or chance convergence, and combined trees of very remotely related families are likely to be less robust than those where there are good grounds for assuming monophyly. A second possibility is the null hypothesis of no relatedness between the Papuan languages. In that case, we would not expect the orderly and geographically consistent phylogenetic signal that does emerge from the data. This signal is consistent with migration followed by divergence through local isolation. A further possibility is that the geographically consistent tree reflects recent areal contact among Papuan speakers, but most of these languages are not currently spoken in contiguous regions. Because these languages may have been contiguous in the past, regional diffusion also may account for the phylogenetic signal observed, a possibility that we cannot test without more detailed archaeological information.

We therefore suggest that this method reveals evidence of large-scale genealogical clustering of the Island Melanesian languages; the lack of putative lexical cognates dates these relationships considerably before the Austronesian arrival, in line with the radiocarbon dates from the later Pleistocene, when humans entered Island Melanesia from mainland Papua New Guinea.

There remain important issues to resolve. The first is methodological; bootstrap values, especially in the deeper branches, are low by comparison with biological systems, and further work is required to determine whether this reflects rates of convergence, trait covariation, or processes other than phylogenesis alone. Second, the branching sequence does not fit the generally expected dispersal path. A priori, Island Melanesian Papuan languages should show a general west-to-east pattern of diversification, with the center of diversity in the west. The results of our data are more complex. In particular, the position of the Solomons languages is anomalous, located in the tree between the Bismarcks clade and the Bougainville clade, in violation of geographic expectation [because Bougainville is the natural way-station on the route from mainland New Guinea to the Solomons (Fig. 1)]. During the late Pleistocene, Bougainville and the Solomons were united into a single island, from which the Bismarcks were always separate. A plausible interpretation of the Papuan language tree is thus that the two language groups now located on the Solomons and Bougainville separated from a common ancestor. This could have happened while they could still freely migrate on a common landmass, a time depth (∼10,000 years) in accord with that required to erode traces of common vocabulary. This population history hypothesis will require further testing with both linguistic and genetic data.

If grammatical structures can retain a phylogenetic signal beyond the current temporal ceiling on the reconstruction of language history, then the possibility is opened up of finding relationships between others of the world's 300 or so existing language families and isolates.

Supporting Online Material

Materials and Methods

Figs. S1 and S2


Sources of language data

Linguistic characters

Data file

References and Notes

View Abstract

Stay Connected to Science

Navigate This Article