{"title": "A Probabilistic Approach to Language Change", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 176, "abstract": "We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. Our framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for defining probabilistic models of phonological change, evaluating these schemes using the reconstruction of ancient word forms in Romance languages. The result is an efficient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies.", "full_text": "A Probabilistic Approach to Language Change\n\nAlexandre Bouchard-C\u02c6ot\u00b4e\u2217\n\nPercy Liang\u2217\n\n\u2217Computer Science Division\n\nUniversity of California at Berkeley\n\nBerkeley, CA 94720\n\nThomas L. Grif\ufb01ths\u2020\n\u2020Department of Psychology\n\nDan Klein\u2217\n\nAbstract\n\nWe present a probabilistic approach to language change in which word forms\nare represented by phoneme sequences that undergo stochastic edits along the\nbranches of a phylogenetic tree. This framework combines the advantages of\nthe classical comparative method with the robustness of corpus-based probabilis-\ntic models. We use this framework to explore the consequences of two differ-\nent schemes for de\ufb01ning probabilistic models of phonological change, evaluating\nthese schemes by reconstructing ancient word forms of Romance languages. The\nresult is an ef\ufb01cient inference procedure for automatically inferring ancient word\nforms from modern languages, which can be generalized to support inferences\nabout linguistic phylogenies.\n\n1 Introduction\n\nLanguages evolve over time, with words changing in form, meaning, and the ways in which they can\nbe combined into sentences. Several centuries of linguistic analysis have shed light on some of the\nkey properties of this evolutionary process, but many open questions remain. A classical example is\nthe hypothetical Proto-Indo-European language, the reconstructed common ancestor of the modern\nIndo-European languages. While the existence and general characteristics of this proto-language are\nwidely accepted, there is still debate regarding its precise phonology, the original homeland of its\nspeakers, and the date of various events in its evolution. The study of how languages change over\ntime is known as diachronic (or historical) linguistics (e.g., [4]).\nMost of what we know about language change comes from the comparative method, in which words\nfrom different languages are compared in order to identify their relationships. The goal is to identify\nregular sound correspondences between languages and use these correspondences to infer the forms\nof proto-languages and the phylogenetic relationships between languages. The motivation for basing\nthe analysis on sounds is that phonological changes are generally more systematic than syntactic or\nmorphological changes. Comparisons of words from different languages are traditionally carried\nout by hand, introducing an element of subjectivity into diachronic linguistics. Early attempts to\nquantify the similarity between languages (e.g., [15]) made drastic simplifying assumptions that\ndrew strong criticism from diachronic linguists.\nIn particular, many of these approaches simply\nrepresent the appearance of a word in two languages with a single bit, rather than allowing for\ngradations based on correspondences between sequences of phonemes.\nWe take a quantitative approach to diachronic linguistics that alleviates this problem by operating\nat the phoneme level. Our approach combines the advantages of the classical, phoneme-based,\ncomparative method with the robustness of corpus-based probabilistic models. We focus on the\ncase where the words are etymological cognates across languages, e.g. French faire and Spanish\nhacer from Latin facere (to do). Following [3], we use this information to estimate a contextualized\nmodel of phonological change expressed as a probability distribution over rules applied to individual\nphonemes. The model is fully generative, and thus can be used to solve a variety of problems. For\nexample, we can reconstruct ancestral word forms or inspect the rules learned along each branch of\n\n1\n\n\fa phylogeny to identify sound laws. Alternatively, we can observe a word in one or more modern\nlanguages, say French and Spanish, and query the corresponding word form in another language,\nsay Italian. Finally, models of this kind can potentially be used as a building block in a system for\ninferring the topology of phylogenetic trees [3].\nIn this paper, we use this general approach to evaluate the performance of two different schemes for\nde\ufb01ning probability distributions over rules. The \ufb01rst scheme, used in [3], treats these distributions\nas simple multinomials and uses a Dirichlet prior on these multinomials. This approach makes it\ndif\ufb01cult to capture rules that apply at different levels of granularity.\nInspired by the prevalence\nof multi-scale rules in diachronic phonology and modern phonological theory, we develop a new\nscheme in which rules possess a set of features, and a distribution over rules is de\ufb01ned using a log-\nlinear model. We evaluate both schemes in reconstructing ancient word forms, showing that the new\nlinguistically-motivated change can improve performance signi\ufb01cantly.\n\n2 Background and previous work\n\nMost previous computational approaches to diachronic linguistics have focused on the reconstruc-\ntion of phylogenetic trees from a Boolean matrix indicating the properties of words in different\nlanguages [10, 6, 14, 13]. These approaches descend from glottochronology [15], which measures\nthe similarity between languages (and the time since they diverged) using the number of words in\nthose languages that belong to the same cognate set. This information is obtained from manually\ncurated cognate lists such as the data of [5]. The modern instantiations of this approach rely on so-\nphisticated techniques for inferring phylogenies borrowed from evolutionary biology (e.g., [11, 7]).\nHowever, they still generally use cognate sets as the basic data for evaluating the similarity between\nlanguages (although some approaches incorporate additional manually constructed features [14]).\nAs an example of a cognate set encoding, consider the meaning \u201ceat\u201d. There would be one column\nfor the cognate set which appears in French as manger and Italian as mangiare since both descend\nfrom the Latin mandere (to chew). There would be another column for the cognate set which appears\nin both Spanish and Portuguese as comer, descending from the Latin comedere (to consume). If\nthese were the only data, algorithms based on this data would tend to conclude that French and Italian\nwere closely related and that Spanish and Portuguese were equally related. However, the cognate\nset representation has several disadvantages: it does not capture the fact that the cognate is closer\nbetween Spanish and Portuguese than between French and Spanish, nor do the resulting models let\nus conclude anything about the regular processes which caused these languages to diverge. Also,\ncurating cognate data can be expensive. In contrast, each word in our work is tracked using an\nautomatically obtained cognate list. While these cognates may be noisier, we compensate for this\nby modeling phonological changes rather than Boolean mutations in cognate sets.\nAnother line of computational work has explored using phonological models as a way to capture\nthe differences between languages. [16] describes an information theoretic measure of the distance\nbetween two dialects of Chinese. They use a probabilistic edit model, but do not consider the recon-\nstruction of ancient word forms, nor do they present a learning algorithm for such models. There\nhave also been several approaches to the problem of cognate prediction in machine translation (es-\nsentially transliteration), e.g., [12]. Compared to our work, the phenomena of interest, and therefore\nthe models, are different. [12] presents a model for learning \u201csound laws,\u201d general phonological\nchanges governing two completely observed aligned cognate lists. This model can be viewed as a\nspecial case of ours using a simple two-node topology.\n\n3 A generative model of phonological change\n\nIn this section, we outline the framework for modeling phonological change that we will use through-\nout the paper. Assume we have a \ufb01xed set of word types (cognate sets) in our vocabulary V and a set\nof languages L. Each word type i has a word form wil in each language l \u2208 L, which is represented\nas a sequence of phonemes which might or might not be observed. The languages are arranged\naccording to some tree topology T (see Figure 2(a) for examples). It is possible to also induce the\ntopology or cognate set assignments, but in this paper we assume that the topology is \ufb01xed and\ncognates have already been identi\ufb01ed.\n\n2\n\n\fFigure 1: (a) A description of the generative model.\n(b) An example of edits that were used to transform\nthe Latin word focus (/fokus/) into the Italian word fuoco (/fwOko/) (\ufb01re) along with the context-speci\ufb01c rules\nthat were applied. (c) The graphical model representation of our model: \u03b8 are the parameters specifying the\nstochastic edits e, which govern how the words w evolve.\n\nconsisting of n phonemes x1 \u00b7\u00b7\u00b7 xn is generated with probability plm(x1) =(cid:81)n\n\nThe probabilistic model speci\ufb01es a distribution over the word forms {wil} for each word type i \u2208 V\nand each language l \u2208 L via a simple generative process (Figure 1(a)). The generative process\nstarts at the root language and generates all the word forms in each language in a top-down manner.\nThe w \u223c LanguageModel distribution is a simple bigram phoneme model. A root word form w\nj=2 plm(xj | xj\u22121),\nwhere plm is the distribution of the language model. The stochastic edit model w(cid:48) \u223c Edit(w, \u03b8)\ndescribes how a single old word form w = x1 \u00b7\u00b7\u00b7 xn changes along one branch of the phylogeny\nwith parameters \u03b8 to produce a new word form w(cid:48). This process is parametrized by rule probabilities\n\u03b8k\u2192l, which are speci\ufb01c to branch (k \u2192 l).\nThe generative process used in the edit model is as follows: for each phoneme xi in the old word\nform, walking from left to right, choose a rule to apply. There are three types of rules: (1) deletion\nof the phoneme, (2) substitution with some phoneme (possibly the same one), or (3) insertion of\nanother phoneme, either before or after the existing one. The probability of applying a rule depends\non the context (xi\u22121, xi+1). Context-dependent rules are often used to characterize phonological\nchanges in diachronic linguistics [4]. Figure 1(b) shows an example of the rules being applied. The\ncontext-dependent form of these rules allows us to represent phenomena such as the likely deletion\nof s in word-\ufb01nal positions.\n\n4 De\ufb01ning distributions over rules\nIn the model de\ufb01ned in the previous section, each branch (k \u2192 l) \u2208 T has a collection of context-\ndependent rule probabilities \u03b8k\u2192l. Speci\ufb01cally, \u03b8k\u2192l speci\ufb01es a collection of multinomial distribu-\ntions, one for each C = (cl, x, cr), where cl is left phoneme, x is the old phoneme, cr is the right\nphoneme. Each multinomial distribution is over possible right-hand sides \u03b1 of the rule, which could\nconsist of 0, 1, or 2 phonemes. We write \u03b8k\u2192l(C, \u03b1) for the probability of rule x \u2192 \u03b1 / c1\nPrevious work using this probabilistic framework simply placed independent Dirichlet priors on\neach of the multinomial distributions [3]. While this choice results in a simple estimation procedure,\nit has some severe limitations. Sound changes happen at many granularities. For example, from\nLatin to Vulgar Latin, u \u2192 o occurs in many contexts while s \u2192 \u2205 occurs only in word-\ufb01nal con-\ntexts. Using independent Dirichlets forces us to commit to a single context granularity for C. Since\nthe different multinomial distributions are not tied together, generalization becomes very dif\ufb01cult,\nespecially as data is limited. It is also dif\ufb01cult to interpret the learned rules, since the evidence\nfor a coarse phenomenon such as u \u2192 o would be unnecessarily fragmented across many different\n\nc2.\n\n3\n\nForeachwordi\u2208V:wiROOT\u223cLanguageModelForeachbranch(k\u2192l)\u2208T:\u03b8k\u2192l\u223cRules(\u03c32)[chooseeditparameters]Foreachwordi\u2208V:wil\u223cEdit(wik,\u03b8k\u2192l)[samplewordform](a)Generativedescription#CVCVC##fokus##fwOko##CVVCV#f\u2192f/#Vo\u2192wO/CCk\u2192k/VVu\u2192o/CCs\u2192/V#EditsappliedRulesused(b)Exampleofedits\u00b7\u00b7\u00b7wiAwiBwiCwiD\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7wordtypei=1...|V|eiA\u2192B\u03b8A\u2192BeiB\u2192C\u03b8B\u2192CeiB\u2192D\u03b8B\u2192D(c)Graphicalmodel\fcontext-dependent rules. We would like to ideally capture a phenomenon using a single rule or fea-\nture. We could relate the rule probabilities via a simple hierarchical Bayesian model, but we would\nstill have to de\ufb01ne a single hierarchy of contexts. This restriction might be inappropriate given that\nsound changes often depend on different contexts that are not necessarily nested.\nFor these reasons, we propose using a feature-based distribution over the rule probabilities. Let\nF (C, \u03b1) be a feature vector that depends on the context-dependent rule (C, \u03b1), and \u03bbk\u2192l be the\nlog-linear weights for branch (k \u2192 l). We use a Normal prior on the log-linear weights, \u03bbk\u2192l \u223c\nN (0, \u03c32I). The rule probabilities are then deterministically related to the weights via the softmax\nfunction:\n\n(cid:80)\n\ne\u03bbT\n\u03b1(cid:48) e\u03bbT\n\nk\u2192lF (C,\u03b1)\n\nk\u2192lF (C,\u03b1(cid:48))\n\n\u03b8k\u2192l(C, \u03b1; \u03bbk\u2192l) =\n\n.\n\n(1)\n\nFor each rule x \u2192 \u03b1 / cl\ncr, we de\ufb01ned features based on whether x = \u03b1 (i.e. self-substitution),\nand whether |\u03b1| = n for each n = 0, 1, 2 (corresponding to deletion, substitution, and insertion).\nWe also de\ufb01ned sets of features using three partitions of phonemes c into \u201cnatural classes\u201d. These\ncorrespond to looking at the place of articulation (denoted A2(c)), testing whether c is a vowel,\nconsonant, or boundary symbol (A1(c)), and the trivial wildcard partition (A0(c)), which allows\nrules to be insensitive to c. Using these partitions, the \ufb01nal set of features corresponded to whether\nAkl(cl) = al and Akr(cr) = ar for each type of partitioning kl, kr \u2208 {0, 1, 2} and natural classes\nal, ar.\nThe move towards using a feature-based scheme for de\ufb01ning rule probabilities is not just motivated\nby the greater expressive capacity of this scheme. It also provides a connection with contemporary\nphonological theory. Recent work in computational linguistics on probabilistic forms of optimality\ntheory has begun to use a similar approach, characterizing the distribution over word forms within a\nlanguage using a log-linear model applied to features of the words [17, 9]. Using similar features to\nde\ufb01ne a distribution over phonological changes thus provides a connection between synchronic and\ndiachronic linguistics in addition to a linguistically-motivated method for improving reconstruction.\n\n5 Learning and inference\n\nWe use a Monte Carlo EM algorithm to \ufb01t the parameters of both models. The algorithm iterates\nbetween a stochastic E-step, which computes reconstructions based on the current edit parameters,\nand an M-step, which updates the edit parameters based on the reconstructions.\n\n5.1 Monte Carlo E-step: sampling the edits\n\nThe E-step computes the expected suf\ufb01cient statistics required for the M-step, which in our case is\nthe expected number of times each edit (such as o \u2192 O) was used in each context. Note that the\nsuf\ufb01cient statistics do not depend on the prior over rule probabilities; in particular, both the model\nbased on independent Dirichlet priors and the one based on a log-linear prior require the same E-step\ncomputation.\nAn exact E-step would require summing over all possible edits involving all languages in the phy-\nlogeny (all unobserved {e},{w} variables in Figure 1(c)), which does not permit a tractable dynamic\nprogram. Therefore, we resort to a Monte Carlo E-step, where many samples of the edit variables\nare collected, and counts are computed based on these samples. Samples are drawn using Gibbs\nsampling [8]: for each word form of a particular language wil, we \ufb01x all other variables in the\nmodel and sample wil along with its corresponding edits.\nConsider the simple four-language topology in Figure 1(c). Suppose that the words in languages A,\nC and D are \ufb01xed, and we wish to sample the word at language B along with the three corresponding\nsets of edits (remember that the edits fully determine the words). While there are an exponential\nnumber of possible words/edits, we can exploit the Markov structure in the edit model to consider\nall such words/edits using dynamic programming, in a way broadly similar to the forward-backward\nalgorithm for HMMs. See [3] for details of the dynamic program.\n\n4\n\n\fExperiment\nLatin reconstruction (6.1)\n\nSound changes (6.2)\n\nTopology\n\n1\n1\n2\n\nModel\nDirichlet\nLog-linear\nLog-linear\n\nHeldout\nla:293\nla:293\nNone\n\n(a) Topologies\n\n(b) Experimental conditions\n\nFigure 2: Conditions under which each of the experiments presented in this section were performed. The\ntopology indices correspond to those displayed at the left. The heldout column indicates how many words, if\nany, were held out for edit distance evaluation, and from which language. All the experiments were run on a\ndata set of 582 cognates from [3].\n\n5.2 M-step: updating the parameters\nIn the M-step, we estimate the distribution over rules for each branch (k \u2192 l). In the Dirichlet\nmodel, this can be done in closed form [3]. In the log-linear model, we need to optimize the feature\nweights \u03bbk\u2192l. Let us \ufb01x a single branch and drop the subscript. Let N(C, \u03b1) be the expected\nnumber of times the rule (C, \u03b1) was used in the E-step. Given these suf\ufb01cient statistics, the estimate\nof \u03bb is given by optimizing the expected complete log-likelihood plus the regularization penalty\nfrom the prior on \u03bb,\n\n(cid:104)\n\nC,\u03b1\n\nN(C, \u03b1)\n\nO(\u03bb) =(cid:88)\n= (cid:88)\n= \u02c6Fj \u2212(cid:88)\n\n\u2202O(\u03bb)\n\u2202\u03bbj\n\nC,\u03b1\n\nN(C, \u03b1)\n\n\u03bbT F (C, \u03b1) \u2212 log(cid:88)\n(cid:104)\nFj(C, \u03b1) \u2212(cid:88)\n\n\u03b1(cid:48)\n\n\u03b1(cid:48)\n\ne\u03bbT F (C,\u03b1(cid:48))(cid:105) \u2212 ||\u03bb||2\n\n2\u03c32 .\n\n\u03b8(C, \u03b1(cid:48); \u03bb)Fj(C, \u03b1(cid:48))\n\n(cid:105) \u2212 \u03bbj\n\n\u03c32\n\n(2)\n\n(3)\n\n(4)\n\nWe use L-BFGS to optimize this convex objective. which only requires the partial derivatives:\n\nN(C,\u00b7)\u03b8(C, \u03b1(cid:48); \u03bb)Fj(C, \u03b1(cid:48)) \u2212 \u03bbj\n\u03c32 ,\n\ndef= (cid:80)\n\nC,\u03b1 N(C, \u03b1)Fj(C, \u03b1) is the empirical feature vector and N(C,\u00b7) def= (cid:80)\n\nwhere \u02c6Fj\n\u03b1 N(C, \u03b1)\nis the number of times context C was used. \u02c6Fj and N(C,\u00b7) do not depend on \u03bb and thus can be\nprecomputed at the beginning of the M-step, thereby speeding up each L-BFGS iteration.\n\nC,\u03b1(cid:48)\n\n6 Experiments\n\nIn this section, we summarize the results of the experiments testing our different probabilistic models\nof phonological change. The experimental conditions are summarized in Table 2. Training and test\ndata sets were taken from [3].\n\n6.1 Reconstruction of ancient word forms\n\nWe ran the two models using Topology 1 in Figure 2 to assess the relative performance of Dirichlet-\nparametrized versus log-linear-parametrized models. Half of the Latin words at the root of the tree\nwere held out, and the (uniform cost) Levenshtein edit distance from the predicted reconstruction to\nthe truth was computed. While the uniform-cost edit distance misses important aspects of phonol-\nogy (all phoneme substitutions are not equal, for instance), it is parameter-free and still seems to\ncorrelate to a large extent with linguistic quality of reconstruction. It is also superior to held-out\nlog-likelihood, which fails to penalize errors in the modeling assumptions, and to measuring the\npercentage of perfect reconstructions, which ignores the degree of correctness of each reconstructed\nword.\n\n5\n\nlaesitlavlibesptitTopology1Topology2\fModel\nDirichlet\nLog-linear (0)\nLog-linear (0,1)\nLog-linear (0,1,2)\n\nBaseline Model\n3.33\n3.21\n3.14\n3.10\n\n3.59\n3.59\n3.59\n3.59\n\nImprovement\n\n7%\n11%\n12%\n14%\n\nTable 1: Results of the edit distance experiment. The language column corresponds to the language held out for\nevaluation. We show the mean edit distance across the evaluation examples. Improvement rate is computed by\ncomparing the score of the algorithm against the baseline described in Section 6.1. The numbers in parentheses\nfor the log-linear model indicate which levels of granularity were used to construct the features (see Section 4).\n\nFigure 3: An example of the proper Latin reconstruction given the Spanish and Italian word forms. Our model\nproduces /dEntes/, which is nearly correct, capturing two out of three of the phenomena.\n\nWe ran EM for 10 iterations for each model, and evaluated performance via a Viterbi derivation pro-\nduced using these parameters. Our baseline for comparison was picking randomly, for each heldout\nnode in the tree, an observed neighboring word (i.e., copy one of the modern forms). Both mod-\nels outperformed this baseline (see Figure 3), and the log-linear model outperformed the Dirichlet\nmodel, suggesting that the featurized system better captures the phonological changes. Moreover,\nadding more features further improved the performance, indicating that being able to express rules\nat multiple levels of granularity allows the model to capture the underlying phonological changes\nmore accurately.\nTo give a qualitative feel for the operation of the system (good and bad), consider the example\nin Figure 3, taken from the Dirichlet-parametrized experiment. The Latin dentis /dEntis/ (teeth) is\nnearly correctly reconstructed as /dEntes/, reconciling the appearance of the /j/ in the Spanish and\nthe disappearance of the \ufb01nal /s/ in the Italian. Note that the /is/ vs. /es/ ending is dif\ufb01cult to predict\nin this context (indeed, it was one of the early distinctions to be eroded in Vulgar Latin).\n\n6.2\n\nInference of phonological changes\n\nAnother use of this model is to automatically recover the phonological drift processes between\nknown or partially-known languages. To facilitate evaluation, we continued in the well-studied Ro-\nmance evolutionary tree. Again, the root is Latin, but we now add an additional modern language,\nPortuguese, and two additional hidden nodes. One of the nodes characterizes the least common an-\ncestor of modern Spanish and Portuguese; the other, the least common ancestor of all three modern\nlanguages. In Figure 2, Topology 2, these two nodes are labeled vl (Vulgar Latin) and ib (Proto-\nIbero Romance), respectively. Since we are omitting many other branches, these names should not\nbe understood as referring to actual historical proto-languages, but, at best, to collapsed points rep-\nresenting several centuries of evolution. Nonetheless, the major reconstructed rules still correspond\nto well-known phenomena and the learned model generally places them on reasonable branches.\nFigure 4 shows the top four general rules for each of the evolutionary branches recovered by the\nlog-linear model. The rules are ranked by the number of times they were used in the derivations\nduring the last iteration of EM. The la, es, pt, and it forms are fully observed while the vl and\nib forms are automatically reconstructed. Figure 4 also shows a speci\ufb01c example of the evolution\nof the Latin VERBUM (word), along with the speci\ufb01c edits employed by the model.\nFor this particular example, both the Dirichlet and the log-linear models produced the same recon-\nstruction in the internal nodes. However, the log-linear parametrization makes inspection of sound\nlaws easier. Indeed, with the Dirichlet model, since the natural classes are of \ufb01xed granularity, some\n\n6\n\n/dEntis//djEntes//dEnti/i\u2192EE\u2192jEs\u2192\fFigure 4: The tree shows the system\u2019s hypothesized transformation of a selected Latin word form, VERBUM\n(word) into the modern Spanish, Italian, and Portuguese pronunciations. The Latin root and modern leaves were\nobserved while the hidden nodes as well as all the derivations were obtained using the parameters computed\nby our model after 10 iterations of EM. Nontrivial rules (i.e. rules that are not identities) used at each stage are\nshown along the corresponding edge. The boxes display the top four nontrivial rules corresponding to each of\nthese evolutionary branches, ordered by the number of times they were applied during the last E step. These\nare grouped and labeled by their active feature of highest weight. ALV stands for alveolar consonant.\n\nrules must be redundantly discovered, which tends to \ufb02ood the top of the rule lists with duplicates.\nIn contrast, the log-linear model groups rules with features of the appropriate degree of generality.\nWhile quantitative evaluation such as measuring edit distance is helpful for comparing results, it is\nalso illuminating to consider the plausibility of the learned parameters in a historical light, which we\ndo here brie\ufb02y. In particular, we consider rules on the branch between la and vl, for which we have\nhistorical evidence. For example, documents such as the Appendix Probi [2] provide indications of\northographic confusions which resulted from the growing gap between Classical Latin and Vulgar\nLatin phonology around the 3rd and 4th centuries AD. The Appendix lists common misspellings of\nLatin words, from which phonological changes can be inferred.\nOn the la to vl branch, rules for word-\ufb01nal deletion of classical case markers dominate the list. It\nis indeed likely that these were generally eliminated in Vulgar Latin. For the deletion of the /m/, the\nAppendix Probi contains pairs such as PASSIM NON PASSI and OLIM NON OLI. For the deletion of\n\ufb01nal /s/, this was observed in early inscriptions, e.g. CORNELIO for CORNELIOS [1]. The frequent\nleveling of the distinction between /o/ and /u/ (which was ranked 5, but was not included for space\nreasons) can be also be found in the Appendix Probi: COLUBER NON COLOBER. Note that in the\nspeci\ufb01c example shown, the model lowers the original /u/ and then re-raises it in the pt branch due\nto a later process along that branch.\nSimilarly, major canonical rules were discovered in other branches as well, for example, /v/ to /b/\nfortition in Spanish, palatalization along several branches, and so on. Of course, the recovered\nwords and rules are not perfect. For example, reconstructed Ibero /trinta/ to Spanish /treinta/ (thirty)\nis generated in an odd fashion using rules /e/ to /i/ and /n/ to /in/. In the Dirichlet model, even when\notherwise reasonable systematic sound changes are captured, the crudeness of the \ufb01xed-granularity\ncontexts can prevent the true context from being captured, resulting in either rules applying with\nlow probability in overly coarse environments or rules being learned redundantly in overly \ufb01ne\nenvironments. The featurized model alleviates this problem.\n\n7 Conclusion\n\nProbabilistic models have the potential to replace traditional methods used for comparing languages\nin diachronic linguistics with quantitative methods for reconstructing word forms and inferring\nphylogenies. In this paper, we presented a novel probabilistic model of phonological change, in\nwhich the rules governing changes in the sound of words are parametrized using the features of the\nphonemes involved. This model goes beyond previous work in this area, providing more accurate\nreconstructions of ancient word forms and connections to current work on phonology in synchronic\nlinguistics. Using a log-linear model to de\ufb01ne the probability of a rule being applied results in a\n\n7\n\nr\u2192R/**e\u2192/ALV#t\u2192d/**\u00d9\u2192s/**u\u2192o/**o\u2192os/C#v\u2192b/**t\u2192te/**/werbum/(la)/verbo/(vl)/veRbo/(ib)/beRbo/(es)/veRbu/(pt)/vErbo/(it)s\u2192/*#m\u2192/*#i\u2192/*V\u00ef\u2192n/*VELARu\u2192o/**e\u2192E/**i\u2192/CVa\u2192ja/**n\u2192m/**a\u21925/**o\u2192u/**e\u21921/**m\u2192u\u2192ow\u2192vr\u2192Rv\u2192bo\u2192ue\u2192E\fstraightforward inference procedure which can be used to both produce accurate reconstructions as\nmeasured by edit distance and identify linguistically plausible rules that account for phonological\nchanges. We believe that this probabilistic approach has the potential to support quantitative analysis\nof the history of languages in a way that can scale to large datasets while remaining sensitive to the\nconcerns that have traditionally motivated diachronic linguistics.\n\nAcknowledgments We would like to thank Bonnie Chantarotwong for her help with the IPA con-\nverter and our reviewers for their comments. This work was supported by a FQRNT fellowship to\nthe \ufb01rst author, a NDSEG fellowship to the second author, NSF grant number BCS-0631518 to the\nthird author, and a Microsoft Research New Faculty Fellowship to the fourth author.\n\nReferences\n[1] W. Sidney Allen. Vox Latina: The Pronunciation of Classical Latin. Cambridge University\n\nPress, 1989.\n\n[2] W.A. Baehrens. Sprachlicher Kommentar zur vulg\u00a8arlateinischen Appendix Probi. Halle\n\n(Saale) M. Niemeyer, 1922.\n\n[3] A. Bouchard-C\u02c6ot\u00b4e, P. Liang, T. Grif\ufb01ths, and D. Klein. A Probabilistic Approach to Diachronic\nPhonology. In Empirical Methods in Natural Language Processing and Computational Natu-\nral Language Learning (EMNLP/CoNLL), 2007.\n\n[4] L. Campbell. Historical Linguistics. The MIT Press, 1998.\n[5] I. Dyen,\n\nand P. Black.\n\nJ.B. Kruskal,\n\nFILE IE-DATA1.\n\nAvailable\n\nat\n\nhttp://www.ntu.edu.au/education/langs/ielex/IE-DATA1, 1997.\n\n[6] S. N. Evans, D. Ringe, and T. Warnow. Inference of divergence times as a statistical inverse\nproblem. In P. Forster and C. Renfrew, editors, Phylogenetic Methods and the Prehistory of\nLanguages. McDonald Institute Monographs, 2004.\n\n[7] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.\n[8] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-\ntion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721\u2013741,\n1984.\n\n[9] S. Goldwater and M. Johnson. Learning ot constraint rankings using a maximum entropy\n\nmodel. Proceedings of the Workshop on Variation within Optimality Theory, 2003.\n\n[10] R. D. Gray and Q. Atkinson. Language-tree divergence times support the Anatolian theory of\n\nIndo-European origins. Nature, 2003.\n\n[11] J. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian inference of phylogeny\n\nand its impact on evolutionary biology. Science, 2001.\n\n[12] G. Kondrak. Algorithms for Language Reconstruction. PhD thesis, University of Toronto,\n\n2002.\n\n[13] L. Nakhleh, D. Ringe, and T. Warnow. Perfect phylogenetic networks: A new methodology\nfor reconstructing the evolutionary history of natural languages. Language, 81:382\u2013420, 2005.\n[14] D. Ringe, T. Warnow, and A. Taylor. Indo-european and computational cladistics. Transactions\n\nof the Philological Society, 100:59\u2013129, 2002.\n\n[15] M. Swadesh. Towards greater accuracy in lexicostatistic dating. Journal of American Linguis-\n\ntics, 21:121\u2013137, 1955.\n\n[16] A. Venkataraman, J. Newman, and J.D. Patrick. A complexity measure for diachronic chinese\nphonology. In J. Coleman, editor, Computational Phonology. Association for Computational\nLinguistics, 1997.\n\n[17] C. Wilson and B. Hayes. A maximum entropy model of phonotactics and phonotactic learning.\n\nLinguistic Inquiry, 2007.\n\n8\n\n\f", "award": [], "sourceid": 887, "authors": [{"given_name": "Alexandre", "family_name": "Bouchard-c\u00f4t\u00e9", "institution": null}, {"given_name": "Percy", "family_name": "Liang", "institution": null}, {"given_name": "Dan", "family_name": "Klein", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}