{"title": "Synergies in learning words and their referents", "book": "Advances in Neural Information Processing Systems", "page_first": 1018, "page_last": 1026, "abstract": "This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these.", "full_text": "Synergies in learning words and their referents\n\nMark Johnson\n\nDepartment of Computing\n\nMacquarie University\nSydney, NSW 2109\n\nMark.Johnson@mq.edu.au\n\nKatherine Demuth\n\nDepartment of Linguistics\n\nMacquarie University\nSydney, NSW 2109\n\nKatherine.Demuth@mq.edu.au\n\nMichael Frank\n\nDepartment of Psychology\n\nStanford University\nPalo Alto, CA 94305\nmcfrank@mit.edu\n\nBevan K. Jones\n\nSchool of Informatics\nUniversity of Edinburgh\n\n10 Crichton Street, Edinburgh EH8 9AB, UK\n\nB.K.Jones@sms.ed.ac.uk\n\nAbstract\n\nThis paper presents Bayesian non-parametric models that simultaneously learn to\nsegment words from phoneme strings and learn the referents of some of those\nwords, and shows that there is a synergistic interaction in the acquisition of these\ntwo kinds of linguistic information. The models themselves are novel kinds of\nAdaptor Grammars that are an extension of an embedding of topic models into\nPCFGs. These models simultaneously segment phoneme sequences into words\nand learn the relationship between non-linguistic objects to the words that refer to\nthem. We show (i) that modelling inter-word dependencies not only improves the\naccuracy of the word segmentation but also of word-object relationships, and (ii)\nthat a model that simultaneously learns word-object relationships and word seg-\nmentation segments more accurately than one that just learns word segmentation\non its own. We argue that these results support an interactive view of language\nacquisition that can take advantage of synergies such as these.\n\n1\n\nIntroduction\n\nConventional views of language acquisition often assume that human language learners initially\nuse a single source of information to acquire one component of language, which they then use to\nleverage the acquisition of other linguistic components. For example, Kuhl [1] presents a standard\n\u201cbootstrapping\u201d view of early language acquisition in which successively more dif\ufb01cult tasks are\naddressed by learners, beginning with phoneme inventory and progressing to word segmentation\nand word learning. This view is also taken implicitly by, e.g., Graf Estes et al [2], who showed that\ninfants were more successful in mapping novel objects to novel words after those words had been\nsuccessfully segmented from the speech stream. We contrast this view with an \u201cinteractive\u201d view of\nlanguage acquisiion in which learners do not move from problem to problem, but instead attempt to\nlearn all of the components of language at once. Computationally speaking, an interactive account\nviews language acquisition as a joint inference problem for all components of language simultane-\nously, rather than a discrete sequence of inference problems for individual language components.\n(We are thus using \u201cinteractive\u201d to refer to the way that language acquisition is formulated as an\ninference problem, rather than a speci\ufb01c mechanism or architecture as in [3]).\nOne advantage of an interactive approach is that it can take advantage of synergies in acquisition,\ni.e., situations where partial knowledge of several different aspects of language mutually aid their\nacquisition, i.e., where improvements in the acquisition of component A also improves the acqui-\n\n1\n\n\fPIG|DOG i (cid:77)z (cid:78)D (cid:77)& (cid:77)t (cid:78)D (cid:77)e (cid:78)p (cid:77)I (cid:77)g\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nPIG\n\nFigure 1: The photograph indicates non-linguistic context containing the (toy) pig and dog for the\nutterance Is that the pig?. Below that, we show the input provided to our models representing\nthis utterance [8]. The objects in the non-linguistic context are indicated by the pre\ufb01x \u201cPIG|DOG\u201d,\nwhich is followed by the unsegmented phonemicised input. The possible word segmentation points\nare indicated by separators between the phonemes. The correct analysis of this input (which is not\nprovided to the model) is depicted by blue annotations to this input. The correct word segmentation\nis indicated by the \ufb01lled blue word separators, and the mapping between words and non-linguistic\nobjects is indicated by the underbrace subscript.\n\nsition of component B, and improvements in the acquisition of component B also improves the\nacquisition of component A. An interactive approach can take advantage of both of these, while\nstaged approach to activation where A is learned before B forgoes the ability to use knowledge of\nB to help learn A.\nIn this paper we focus on the acquisition of two of the simpler aspects of language: (i) segmenting\nsentences into words (thereby identifying their pronunciations), and (ii) the relationship between\nwords and the objects they refer to. We present a sequence of models for inferring (i) and (ii), and\ndemonstrate synergistic interactions in learning. Speci\ufb01cally, we show that (i) modifying the model\nin a way that improves its word segmentation ability also improves its ability to identify the intended\nreferents of utterances, and that (ii) incorporating a more sophisticated model of the relationship\nbetween words and the objects they refer to also improves the model\u2019s ability to segment words.\nThe acquisition of word pronunciations is viewed as a segmentation problem as follows. Following\nElman [4] and Brent [5, 6], a corpus of child-directed speech is \u201cphonemicised\u201d by looking each\nword up in a pronouncing dictionary and concatenating those pronunciations. For example, the\nmother\u2019s utterance Is that the pig is mapped to the broad phonemic representation Iz D&t D6 pIg (in\nan ASCII-based broad phonemic encoding), which are then concatenated to form IzD&tD6pIg. The\nword segmentation task is to segment a corpus of such unsegmented utterance representations into\nwords, thus identifying the pronunciations of the words in the corpus.\nWe study the acquisition of the relationship between words and the objects they refer to using the\nframework proposed by Frank et al [7]. Here each utterance in the corpus is labelled with the\ncontextually-relevant objects that the speaker might be referring to. These are determined by in-\nspecting videos of the utterance context. For example, in the context of Figure 1, the utterance\nwould be labelled with the two contextually-relevant objects PIG and DOG. The learner\u2019s task is\nidentify which words, if any, in the utterance refer to each of these objects.\nJones et al [8] combined the word segmentation and word reference tasks into a single inference\ntask, where the goal is to simultaneously segment the utterance into words, and to map a subset of\nthe words of each utterance to the utterance\u2019s contextually-relevant objects. This is the task that we\ninvestigate in this paper.\nThe rest of this paper is structured as follows. The next section summarises previous work on word\nsegmentation and learning the relationship between words and their referents. Section 3 introduces\nAdaptor Grammars, explains how they can be used for word segmentation and topic modelling, and\npresents the Adaptor Grammars that will be used in this paper. Section 4 presents experimental\n\n2\n\n\fresults showing synergistic interactions between word segmentation and learning the relationship\nbetween words and the objects they refer to, while section 5 summarises and concludes the paper.\n\n2 Previous work\n\nWord segmentation has been studied using a wide variety of computational perspectives. Elman [4]\nand Brent [5, 6] introduced the basic word segmentation paradigm investigated here. Goldwater\net al [9] introduced a non-parametric model of word segmentation based on Hierarchical Dirichlet\nProcesses (HDPs) [10], and demonstrated that a bigram model, which captures dependencies be-\ntween adjacent words, produces signi\ufb01cantly more accurate segmentations than a unigram model,\nwhich assumes each word in a sentence is generated independently. Because the unigram model\nmakes the \u201cbag of words\u201d assumption it has no way to capture inter-word dependencies. Because\nthere are strong inter-word dependencies in real language, e.g., a noun like ball is very likely to be\npreceeded by determiners the or a, a unigram model tends to undersegment, e.g., misanalyse the ball\nas a single word. The bigram model, because it explicitly models and hence can \u201cexplain away\u201d the\ndependency between the and ball, is more likely to correctly segment this example.\nJohnson et al [11] introduced a generalisation of Probabilistic Context-Free Grammars (PCFGs)\ncalled Adaptor Grammars (AGs) as a framework for specifying HDPs for linguistic applications\n(because this paper relies heavily on AGs we describe them in more detail in section 3 below).\nJohnson [12] investigated AGs for word segmentation that capture a range of different kinds of\ngeneralisations. The unigram AG replicates the unigram segmentation model of Goldwater et al,\nand suffers from the same undersegmentation problems. It turns out that it is not possible to express\nGoldwater et al\u2019s bigram model as an AG, but a collocation AG, which is a HDP that generates\na sentence as a sequence of collocations where each collocation is a sequence of words, captures\nsimiliar inter-word dependencies and produces very similiar word segmentation results.\nThe acquisition of the mapping between words and the objects they refer to was studied by Frank\net al [7]. They used a modi\ufb01ed version of the LDA topic model [13] where the \u201ctopics\u201d are\ncontextually-relevant objects that words in the utterance can refer to, so the mapping from \u201ctop-\nics\u201d to words effectively speci\ufb01es which words refer to these contextually-salient objects. Jones\net al [8] integrated the Frank et al \u201ctopic\u201d model of the word-object relationship with the unigram\nmodel of Goldwater et al to obtain a joint model that both performs word segmentation and also\nlearns which words refer to which contextually-salient objects.\nJohnson [14] explains how LDA topic models can be expressed as PCFGs. We use this reduction\nto express Frank et al models [7] of the word to object relationship as AGs which also incorporate\nJohnson\u2019s [12] models of word segmentation. The resulting AGs can express a wide range of joint\nHDP models of word segmentation and the word-object relationship, including the model proposed\nby Jones et al [8], as well as several generalisations.\n\n3 Adaptor grammars for segmentation and word-object acquisition\n\nThis section provides an informal introduction to Adaptor Grammars (AGs) and how they can be\nused to express word segmentation and topic models, and presents the AGs for joint segmentation\nand acquisition of the word-object relationship. For more detail on the formal properties of AGs see\n[11], and for information on AG inference procedures see [15, 16].\n\n3.1 Probabilistic Context-Free Grammars\n\nAdaptor Grammars (AGs) are an extension of Probabilistic Context-Free Grammars (PCFGs), which\nwe describe \ufb01rst. A Context-Free Grammar (CFG) G = (N, W, R, S) consists of disjoint \ufb01nite sets\nof nonterminal symbols N and terminal symbols W , a \ufb01nite set of rules R of the form A \u2192 \u03b1 where\nA \u2208 N and \u03b1 \u2208 (N \u222a W )(cid:63), and a start symbol S \u2208 N. (We assume there are no \u201c\u0001-rules\u201d in R, i.e.,\nwe require that |\u03b1| \u2265 1 for each A \u2192 \u03b1 \u2208 R).\nA CFG G generates a set of \ufb01nite, labelled, ordered trees TX for each X \u2208 N \u222a W . If X \u2208 W (i.e.,\nX is a terminal) then TX = {X}, i.e., the singleton set consisting of a one-node tree labelled X. If\nX \u2208 N then TX consists of all trees t whose root node is labelled X, each leaf node\u2019s label is in\n\n3\n\n\fa CFG and \u03b8 is a vector of non-negative reals indexed by R that satisfy(cid:80)\n\nW , each non-leaf node\u2019s label is in N, and for each non-leaf node x in t with label A \u2208 N there is\na rule A \u2192 \u03b1 \u2208 R such that the sequence of labels of x\u2019s children is \u03b1. The set of strings generated\nby G is the set of yields of TS, where the yield of a tree is sequence of its leaf nodes\u2019 labels.\nA Probabilistic Context-Free Grammar PCFG is a quintuple (N, W, R, S, \u03b8) where (N, W, R, S) is\n\u03b8A\u2192\u03b1 = 1 for each\n\u03b1\u2208RA\nA \u2208 N, where RA = {A \u2192 \u03b1 : A \u2192 \u03b1 \u2208 R} is the set of rules expanding A.\nInformally, \u03b8A\u2192\u03b1 is the probability of a node labelled A expanding to a sequence of nodes labelled\n\u03b1, and the probability of a tree is the product of the probabilities of the rules used to construct each\nnon-leaf node in it. More precisely, for each X \u2208 N \u222a W a PCFG associates distributions GX over\nthe trees TX as follows:\nIf X \u2208 W (i.e., if X is a terminal) then GX is the distribution that puts probability 1 on the single-\nnode tree labelled X. If X \u2208 N (i.e., if X is a nonterminal) then:\n\nGX = (cid:88)\n\nX\u2192B1...Bn\u2208RX\n\n\u03b8X\u2192B1...Bn TDX(GB1, . . . , GBn)\n\n(1)\n\nwhere:\n\n(cid:32)\n\nTDA(G1, . . . , Gn)\n\n(cid:33)\n\nn(cid:89)\n\ni=1\n\n=\n\nGi(ti).\n\nX\n\n\u0010\u0010 PP\nt1\ntn\n\n. . .\n\nThat is, TDA(G1, . . . , Gn) is a distribution over TA where each subtree ti is generated indepen-\ndently from Gi. The PCFG generates the distribution GS over the trees TS, where S is the start\nsymbol; the distribution over the strings it generates is obtained by marginalising over the trees.\nIn a Bayesian PCFG one puts Dirichlet priors Dir(\u03b1) on the rule probability vector \u03b8, such that\nthere is one Dirichlet parameter \u03b1A\u2192\u03b1 for each rule A \u2192 \u03b1 \u2208 R. In the \u201cunsupervised\u201d inference\nproblem for a PCFG one is given a CFG, parameters \u03b1 for Dirichlet priors over the rule probabil-\nities, and a corpus of strings. The task is to infer the corresponding posterior distribution over rule\nprobabilities \u03b8. Recently Bayesian inference algorithms for PCFGs have been described. Kurihara\net al [17] describe a Variational Bayes algorithm for inferring PCFGs using a mean-\ufb01eld approxi-\nmation, while Johnson et al [18] describe a Markov Chain Monte Carlo algorithm based on Gibbs\nsampling.\n\n3.2 Modelling word-object reference using PCFGs\n\nThis section presents a novel encoding of a Frank et al [7] model for identifying word-object re-\nlationships as a PCFG. It is an adaptation of the reduction of LDA topic models to PCFGs given\nby Johnson [14]. That paper showed how to construct a PCFG that generates the same distribution\nover a collection of documents as an LDA model, and where Bayesian inference for the PCFG\u2019s\nrule probabilities yields the corresponding distributions as Bayesian inference of the corresponding\nLDA models. Because the Frank et al [7] model of the word-object relationship is very similiar to an\nLDA topic model, we can use the same techniques to design Bayesian PCFGs that infer word-object\nrelationships.\nThe models we investigate in this paper assume that the words in a single sentence refer to at most\none non-linguistic object (although it would be easy to relax this restriction). In this subsection\nwe assume that the vocabulary V (i.e., a set of words) is given, as is the set O of objects that they\ncan refer to. Let O(cid:48) = O \u222a {\u2205}, where \u2205 is a distinguished \u201cnull object\u201d not in O, and let the\nnonterminals N = {S} \u222a {Ao, Bo : o \u2208 O(cid:48)}, where Ao and Bo are nonterminals indexed by the\no \u2208 O. Informally, a nonterminal Bo expanding to word w \u2208 V indicates that w refers to object o,\nwhile a B\u2205 expanding to w indicates that w is non-referential.\nThe set of objects in the non-linguistic context of an utterance is indicated by pre\ufb01xing the utterance\nwith a context identi\ufb01er associated with those objects, such as \u201cPIG|DOG\u201d in Figure 1. A context\nidenti\ufb01er c is a subset of O(cid:48) that contains \u2205 (i.e., the null object is always in context). We assume\nwe are given a (non-empty) set C of context identi\ufb01ers disjoint from V . Then the terminals of the\n\n4\n\n\fS\nApig\n\n\u0018\u0018XX\n\nBpig\npig\n\nApig\n\n\u0018\u0018XX\nB\u2205\nthe\n\nApig\n\n\u0018\u0018XX\nB\u2205\nthat\n\nApig\n\nApig\n\n\u0018\u0018XX\nB\u2205\nis\n\nPIG|DOG\n\nFigure 2: A tree generated by the reference PCFG encoding a Frank et al [7] model of the word-\nobject relationship. The yield of this tree corresponds to the sentence Is that the pig, and the context\nidenti\ufb01er is \u201cPIG|DOG\u201d.\n\nPCFG are W = V \u222a C, and the rules R of the PCFG are all instances of the following schemata:\n\nS \u2192 Ao\nAo \u2192 c\nAo \u2192 Ao Bo\nAo \u2192 Ao B\u2205\nBo \u2192 w\n\no \u2208 O(cid:48)\nc \u2208 C, o \u2208 c\no \u2208 O(cid:48)\no \u2208 O(cid:48)\no \u2208 O(cid:48), w \u2208 V\n\n(2)\n\nWe call this the reference PCFG because it generates word-object reference pairs. An example of a\ntree generated by this grammar is shown in Figure 2. This grammar generates sentences consisting\nof a context identi\ufb01er followed by a sequence of words; e.g. PIG|DOG is that the pig. Informally, the\nrule expanding S picks an object o that the words in the object can refer to (if o = \u2205 then all words\nin the sentence are non-referential). The \ufb01rst rule expanding Ao ensures that o is a member of that\nsentence\u2019s non-linguistic context, the second rule generates a Bo that will ultimately generate a word\nw (which we take to indicate that w refers to o), while the third rule generates a word associated\nwith the null object \u2205.\nA slightly more complicated PCFG, which we call the reference1 grammar, can enforce the require-\nment that there is at most one referential word in each sentence. This constraint often holds in\nthe simple sentences that appear in infant-directed speech (e.g., in Is that the pig?, the pig is only\nmentioned once).\n\nS \u2192 S B\u2205\nS \u2192 c\nS \u2192 Ao Bo\nAo \u2192 c\nAo \u2192 Ao B\u2205\nBo \u2192 w\n\nc \u2208 C\no \u2208 O\nc \u2208 C, o \u2208 c\no \u2208 O\no \u2208 O(cid:48), w \u2208 V\n\n(3)\n\nIn this grammar the nonterminal labels function as states that record not just which object a ref-\nerential word refers to, but also whether that referential word has been generated or not. Viewed\ntop-down, the switch from S to Ao indicates that a word from Bo has just been generated (i.e.,\nwhich we interpret as referring to object o). This object o is passed down the Ao chain generating\nwords from B\u2205; the \ufb01nal expansion of Ao \u2192 c checks that o is compatible with the context indicator\nc.\n\n3.3 Adaptor grammars\n\nThis subsection brie\ufb02y reviews adaptor grammars; for more detail see [11]. An Adaptor Grammar\n(AG) is a septuple (N, W, R, S, \u03b8, A, C) consisting of a PCFG (N, W, R, S, \u03b8) in which a subset\nA \u2286 N of the nonterminals are identi\ufb01ed as adapted, and where each adapted nonterminal X \u2208 A\nhas an associated adaptor CX. An adaptor CX for X is a function that maps a distribution over\ntrees TX to a distribution over distributions over TX. In this paper we use two-parameter Poisson-\nDirirchlet distributions as adaptors, so the corresponding predictive distributions are Pitman-Yor\nProcesses (PYPs).\n\n5\n\n\fJust as for a PCFG, an AG de\ufb01nes distributions GX over trees TX for each X \u2208 N \u222a W . If X \u2208 W\nor X (cid:54)\u2208 A then GX is de\ufb01ned just as for a PCFG above, i.e., using (1). However, if X \u2208 A then GX\nis de\ufb01ned in terms of an additional distribution HX as follows:\n\nGX \u223c CX(HX)\n\nHX = (cid:88)\n\nX\u2192Y1...Ym\u2208RX\n\n\u03b8X\u2192Y1...Ym TDX(GY1, . . . , GYm)\n\nThat is, the distribution GX associated with an adapted nonterminal X \u2208 A is a sample from\n\u201cadapting\u201d (i.e., applying CX to) its \u201cordinary\u201d PCFG distribution HX.\nJust as with the PCFG, an AG generates the distribution over trees GS, where S \u2208 N is the start\nsymbol. However, while GS in a PCFG is a \ufb01xed distribution (given the rule probabilities \u03b8), in an\nAG the distribution GS is itself a random variable (because each GX for X \u2208 A is random).\nInformally, an AG can be understood as caching the trees associated with adapted nonterminals.\nGenerating a tree associated with an adapted nonterminal involves either reusing an already gener-\nated tree from the cache, or else generating a \u201cfresh\u201d tree as in a PCFG.\n\n3.4 Word segmentation with adaptor grammars\n\nAGs can be used as models of word segmentation, which we brie\ufb02y review here; see Johnson [12]\nfor more details. The input to the AG consists of a corpus of phoneme strings. For example, the\nphoneme string corresponding to Is that the pig? (with its correct segmentation indicated in blue) is\nas follows:\n\ni (cid:77)z (cid:78)D (cid:77)& (cid:77)t (cid:78)D (cid:77)e (cid:78)p (cid:77)I (cid:77)g\n\nWe can represent any possible segmentation of any possible sentence as a tree generated by the\nfollowing unigram AG.\n\nSentence \u2192 Word+\nWord \u2192 Phoneme+\nPhonemes \u2192 a | b | . . .\n\n(4)\n\n(5)\n\nThe trees generated by this adaptor grammar are the same as the trees generated by the CFG rules.\n(In this and following grammars, the Kleene \u201c+\u201d is expanded into a set of left-recursive rules). For\nexample, the following skeletal parse in which all but the Word nonterminals are suppressed (the\nothers are deterministically inferrable) shows the parse that corresponds to the correct segmentation\nof the string above.\n\n(Word i z) (Word D & t) (Word D e) (Word p I g)\n\nBecause the Word nonterminal in the AG is adapted (indicated here by underlining) the adaptor\ngrammar learns the probability of the entire Word subtrees (e.g., the probability that pIg is a Word);\nsee [12] for further details. This AG implements the unigram segmentation model of Goldwater\net al [9], and as explained in section 2, it has the same tendancy to undersegment as the original\nunigram model.\nThe collocation AG (5) produces a more accurate segmentation because it models (and therefore\n\u201cexplain away\u201d) some of the inter-word dependencies.\n\nSentence \u2192 Colloc+\nColloc \u2192 Word+\nWord \u2192 Phoneme+\nPhonemes \u2192 a | b | . . .\n\nThe collocation AG is a hierarchical process, where the base distribution for the Colloc (colloca-\ntion) nonterminal adaptor is generated from the Word distribution. The collocation AG generates a\nsentence as a sequence of Colloc (collocation) nonterminals, each of which is a sequence of Word\nnonterminals. It generates skeletal parses such as the following:\n\n(Colloc (Word i z)) (Colloc (Word D & t)) (Colloc (Word D e) (Word p I g))\n\nIn this parse, iz and D&t are analysed as both Words and Collocations, while De pIg is analysed\nas a Collocation consisting of two Words. Given training corpora like the ones we use here, the\ncollocations this AG \ufb01nds are often noun phrases.\n\n6\n\n\f3.5 Adaptor grammars for joint segmentation and word-object acquisition\n\nThis section explains how to combine the word-object reference PCFGs presented in section 3.2\nwith the word segmentation AGs presented in section 3.4. Combining the word-object reference\nPCFGs (2) or (3) with the unigram AG (4) is relatively straight-forward; all we need to do is replace\nthe last rule Bo \u2192 w in these grammars with Bo \u2192 Phoneme+, i.e., the Bo nonterminals expand\nto an arbitray sequence of phonemes, and the Bo nonterminals are adapted, so these subtrees are\ncached and reused as appropriate. For example, the unigram-reference AG is as follows:\n\no \u2208 O(cid:48)\nS \u2192 Ao\nc \u2208 C, o \u2208 c\nAo \u2192 c\nAo \u2192 Ao Bo\no \u2208 O(cid:48)\no \u2208 O(cid:48)\nAo \u2192 Ao B\u2205\nBo \u2192 Phoneme+ o \u2208 O(cid:48)\n\nThe unigram-reference AG speci\ufb01es essentially the same model as the one investigated in Jones et\nal [8], and the results below are consistent with those that Jones et al report. This grammar generates\na skeletal parses such as the following:\n\n(B\u2205 i z) (B\u2205 D & t) (B\u2205 D e) (BPIG p I g)\n\nThe unigram-reference1 AG is similiar to the unigram-reference AG, except that it stipulates that at\nmost one word per sentence is associated with a (non-null) object.\nIt is also possible to combine the word-object reference PCFGs with the collocation AG. The re-\nsulting AGs are straight-forward but more complex, so they are not shown here. The collocation-\nreference AG is a combination of the collocation AG for word segmentation and the reference PCFG\nfor modelling the word-object relationship. It permits an arbitrary number of words in a sentence to\nbe referential.\nInterestingly, there are two different reasonable ways of combining the collocation AG with the\nreference1 PCFG. The collocation-reference1 AG requires that at most one word in a sentence is\nreferential, just like the reference1 PCFG (3).\nThe collocation-referenceC1 AG is similiar to the collocation-reference1 AG, except that it requires\nthat at most one word in a collocation is referential. This means that the collocation-referenceC1\nAG permits multiple referential words in a sentence (but they must all refer to the same object). This\nAG is linguistically plausible because a collocation often consists of a content word, which may be\nreferential, surrounded by function words, which are generally not referential.\n\n4 Experimental results\n\nWe used the same training corpus as Jones et al [8], which was based on the corpus collected by\nFernald et al [19] annotated with the objects in the non-linguistic context by Frank et al [7]. In these\nexperiments we used the publically-available AG inference software described in [15]. Rather than\nspecifying the concentration parameters of each Pitman-Yor Processes (PYPs) associated with the\nadapted nonterminals, that software permits us to place priors on them and sample them. Here we\nplaced a uniform prior on all PYP a parameters and a sparse Gamma(100, 0.01) prior on the PYP b\nparameters.\nFor each grammar we ran 8 MCMC chains for 5,000 iterations each over the corpus, and collected\nthe sample parses from every 10th iteration from the last 2,500 iterations generated by each run.\nFor each sentence in each sample we extracted the word segmentation and the word-object rela-\ntionships the parse implies, so we obtained 2,000 sample analyses for each sentence in the corpus.\nWe computed the modal (i.e., most frequent) analysis of each sentence, and this is what we scored\nbelow [15].\nPerhaps the most basic question is: does non-linguistic context help word segmentation? We mea-\nsure accuracy here by token f-score [9]. Jones et al [8] investigated this question by comparing\nanalyses from what we are calling the unigram and unigram-reference models, and failed to \ufb01nd\nany overall effect of the non-linguistic context (although they did show that it improves the segmen-\ntation accuracy of referential words). However, as the following table shows, we do see a marked\n\n7\n\n\fimprovement in word segmentation f-score when we combine non-linguistic context with the more\naccurate collocation models.\n\nModel\nunigram\n\nunigram-reference\nunigram-reference1\n\ncollocation\n\ncollocation-reference\ncollocation-reference1\ncollocation-referenceC1\n\nword segmentation f-score\n\n0.533\n0.537\n0.547\n0.695\n0.726\n0.719\n0.750\n\nWe can also ask the converse question: does better word segmentation improve sentence referent\nidenti\ufb01cation? Here we measure how well the models identify which object, if any, this sentence\nrefers to, and does not directly evaluate word segmentation accuracy. The baseline model here\nassigns each sentence the \u201cnull\u201d \u2205 object, achieving an accuracy of 0.709. As the table below\nshows, only the collocation-referenceC1 AG with its more complex constraints on the word-object\nrelationship clearly surpasses this baseline. We can also measure the f-score with which the models\nidentify non-\u2205 sentence referents; now the trivial baseline model achieves 0 f-score.\n\nModel\nunigram\n\nunigram-reference\nunigram-reference1\n\ncollocation\n\ncollocation-reference\ncollocation-reference1\ncollocation-referenceC1\n\nsentence referent accuracy\n\n0.709\n0.702\n0.503\n0.709\n0.728\n0.440\n0.839\n\nsentence referent f-score\n\n0\n\n0\n\n0.355\n0.495\n\n0.280\n0.493\n0.747\n\nWe see a marked improvement in sentence referent accuracy and sentence referent f-score with the\ncollocation-referenceC1 AG.\nFinally, we can ask: how well do the models identify the head nouns of referring noun phrases, such\nas pIg in De pIg? We measure this by calculating the f-score of (word,object) token pairs identi\ufb01ed\nby the model, where the object is not \u2205. This is a single number that indicates how good the models\nare at identifying referring words and the words that they refer to.\n\nunigram-reference\nunigram-reference1\n\ncolloc\n\nModel\nunigram\n\ncollocation-reference\ncollocation-reference1\ncollocation-referenceC1\n\ntopical word f-score\n\n0\n\n0\n\n0.149\n0.147\n\n0.220\n0.321\n0.636\n\nAgain, we \ufb01nd that the collocation-referenceC1 AG identi\ufb01es referring words and the objects they\nrefer to more accurately than the other models.\n\n5 Conclusion\n\nThis paper has used Adaptor Grammars (AGs) to formulate a variety of models that jointly segment\nutterances into words and identify the objects in the non-linguistic context that some of these words\nrefer to. The AGs differed in the kinds of generalisations they are capable of learning, and in the\nrelationship between word segmentation and word reference that they assume. The most accurate re-\nsults in word segmentation and in the identi\ufb01cation of the word-object relationship were obtained by\nthe collocation-referenceC1 AG that tightly integrates a collocation-based model of word segmen-\ntation with constraints that require no more than one referential word per collocation. As argued in\nthe introduction, this is consistent with an \u201cinteractive\u201d approach to language learning.\n\n8\n\n\fReferences\n[1] Patricia K. Kuhl. Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience,\n\n5:831\u2013843, 2004.\n\n[2] Katharine Graf Estes, Julia L. Evans, Martha W. Alibali, and Jenny R. Saffran. Can infants map mean-\ning to newly segmented words? statistical segmentation and word learning. Psychological Science,\n18(3):254\u2013260, 2007.\n\n[3] James L. McClelland and David E. Rummelhart. An interactive activation model of context effects in\n\nletter perception. Psychological Review, 88(5):375\u2013407, 1981.\n\n[4] Jeffrey Elman. Finding structure in time. Cognitive Science, 14:197\u2013211, 1990.\n[5] M. Brent and T. Cartwright. Distributional regularity and phonotactic constraints are useful for segmen-\n\ntation. Cognition, 61:93\u2013125, 1996.\n\n[6] M. Brent. An ef\ufb01cient, probabilistically sound algorithm for segmentation and word discovery. Machine\n\nLearning, 34:71\u2013105, 1999.\n\n[7] Michael C. Frank, Noah Goodman, and Joshua Tenenbaum. Using speakers\u2019 referential intentions to\n\nmodel early cross-situational word learning. Psychological Science, 20:579\u2013585, 2009.\n\n[8] Bevan K. Jones, Mark Johnson, and Michael C. Frank. Learning words and their meanings from unseg-\nmented child-directed speech. In Human Language Technologies: The 2010 Annual Conference of the\nNorth American Chapter of the Association for Computational Linguistics, pages 501\u2013509, Los Angeles,\nCalifornia, June 2010. Association for Computational Linguistics.\n\n[9] Sharon Goldwater, Thomas L. Grif\ufb01ths, and Mark Johnson. A Bayesian framework for word segmenta-\n\ntion: Exploring the effects of context. Cognition, 112(1):21 \u2013 54, 2009.\n\n[10] Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American\n\nStatistical Association, 101:1566\u20131581, 2006.\n\n[11] Mark Johnson, Thomas L. Grif\ufb01ths, and Sharon Goldwater. Adaptor Grammars: A framework for speci-\nfying compositional nonparametric Bayesian models. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors,\nAdvances in Neural Information Processing Systems 19, pages 641\u2013648. MIT Press, Cambridge, MA,\n2007.\n\n[12] Mark Johnson. Using adaptor grammars to identifying synergies in the unsupervised acquisition of lin-\nguistic structure. In Proceedings of the 46th Annual Meeting of the Association of Computational Lin-\nguistics, Columbus, Ohio, 2008. Association for Computational Linguistics.\n\n[13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[14] Mark Johnson. PCFGs, topic models, adaptor grammars and learning topical collocations and the struc-\nture of proper names. In Proceedings of the 48th Annual Meeting of the Association for Computational\nLinguistics, pages 1148\u20131157, Uppsala, Sweden, July 2010. Association for Computational Linguistics.\n[15] Mark Johnson and Sharon Goldwater. Improving nonparameteric Bayesian inference: experiments on\nunsupervised word segmentation with adaptor grammars. In Proceedings of Human Language Technolo-\ngies: The 2009 Annual Conference of the North American Chapter of the Association for Computational\nLinguistics, pages 317\u2013325, Boulder, Colorado, June 2009. Association for Computational Linguistics.\n\n[16] Shay B. Cohen, David M. Blei, and Noah A. Smith. Variational inference for adaptor grammars.\n\nIn\nHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the As-\nsociation for Computational Linguistics, pages 564\u2013572, Los Angeles, California, June 2010. Association\nfor Computational Linguistics.\n\n[17] Kenichi Kurihara and Taisuke Sato. Variational Bayesian grammar induction for natural language. In 8th\n\nInternational Colloquium on Grammatical Inference, 2006.\n\n[18] Mark Johnson, Thomas Grif\ufb01ths, and Sharon Goldwater. Bayesian inference for PCFGs via Markov chain\nMonte Carlo. In Human Language Technologies 2007: The Conference of the North American Chapter\nof the Association for Computational Linguistics; Proceedings of the Main Conference, pages 139\u2013146,\nRochester, New York, April 2007. Association for Computational Linguistics.\n\n[19] Anne Fernald and Hiromi Morikawa. Common themes and cultural variations in Japanese and American\n\nmothers\u2019 speech to infants. Child Development, 64(3):637\u2013656, 1993.\n\n9\n\n\f", "award": [], "sourceid": 530, "authors": [{"given_name": "Mark", "family_name": "Johnson", "institution": null}, {"given_name": "Katherine", "family_name": "Demuth", "institution": null}, {"given_name": "Bevan", "family_name": "Jones", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}