{"title": "Automatic Acquisition and Efficient Representation of Syntactic Structures", "book": "Advances in Neural Information Processing Systems", "page_first": 107, "page_last": 114, "abstract": null, "full_text": "Automatic Acquisition and Ef\ufb01cient\nRepresentation of Syntactic Structures\n\nZach Solan, Eytan Ruppin, David Horn\n\nShimon Edelman\n\nDepartment of Psychology\n\nCornell University\n\nIthaca, NY 14853, USA\n\nse37@cornell.edu\n\nFaculty of Exact Sciences\n\nTel Aviv University\nTel Aviv, Israel 69978\n\nfrsolan,ruppin,horng@post.tau.ac.il\n\nAbstract\n\nThe distributional principle according to which morphemes that occur in\nidentical contexts belong, in some sense, to the same category [1] has\nbeen advanced as a means for extracting syntactic structures from corpus\ndata. We extend this principle by applying it recursively, and by us-\ning mutual information for estimating category coherence. The resulting\nmodel learns, in an unsupervised fashion, highly structured, distributed\nrepresentations of syntactic knowledge from corpora.\nIt also exhibits\npromising behavior in tasks usually thought to require representations\nanchored in a grammar, such as systematicity.\n\n1 Motivation\n\nModels dealing with the acquisition of syntactic knowledge are sharply divided into two\nclasses, depending on whether they subscribe to some variant of the classical generative\ntheory of syntax, or operate within the framework of \u201cgeneral-purpose\u201d statistical or dis-\ntributional learning. An example of the former is the model of [2], which attempts to\nlearn syntactic structures such as Functional Category, as stipulated by the Government\nand Binding theory. An example of the latter model is Elman\u2019s widely used Simple Recur-\nsive Network (SRN) [3].\n\nWe believe that polarization between statistical and classical (generative, rule-based) ap-\nproaches to syntax is counterproductive, because it hampers the integration of the stronger\naspects of each method into a common powerful framework. Indeed, on the one hand, the\nstatistical approach is geared to take advantage of the considerable progress made to date in\nthe areas of distributed representation, probabilistic learning, and \u201cconnectionist\u201d model-\ning. Yet, generic connectionist architectures are ill-suited to the abstraction and processing\nof symbolic information. On the other hand, classical rule-based systems excel in just those\ntasks, yet are brittle and dif\ufb01cult to train.\n\nWe present a scheme that acquires \u201craw\u201d syntactic information construed in a distributional\nsense, yet also supports the distillation of rule-like regularities out of the accrued statisti-\ncal knowledge. Our research is motivated by linguistic theories that postulate syntactic\nstructures (and transformations) rooted in distributional data, as exempli\ufb01ed by the work\nof Zellig Harris [1].\n\n\f2 The ADIOS model\n\nThe ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representa-\ntions of a sample of language from unlabeled corpus data. The model consists of two\nelements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition\n(PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to\ndetect patterns \u2014 repetitive sequences of \u201csigni\ufb01cant\u201d strings of primitives occurring in\nthe corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4]\nand regular expression (\u201clocal grammar\u201d) extraction [5] from corpora. We stress, however,\nthat our algorithm requires no pre-judging either of the scope of the primitives or of their\nclassi\ufb01cation, say, into syntactic categories: all the information needed for its operation is\nextracted from the corpus in an unsupervised fashion.\n\nIn the initial phase of the PA algorithm the text is segmented down to the smallest possible\nmorphological constituents (e.g., ed is split off both walked and bed; the algorithm later\ndiscovers that bed should be left whole, on statistical grounds).1 This initial set of unique\nconstituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is\ninserted between two vertices whenever the corresponding transition exists in the corpus\n(Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index.\nThus, corpus sentences initially correspond to paths in the graph, a path being a sequence\nof edges that share the same sentence number.\n\n(a)\n\nmh\n\nmi\n\nmk\n\nmj\n\nci{j,k}l\n\n(b)\n\nml\n\nmn\n\nmi\n\nck\n\ncj\n\n...\n\nml\n\ncu\n\ncv\n\n(a) Two sequences mi; mj ; ml and mi; mk; ml form a pattern cifj;kgl\n\n:\nFigure 1:\n=\nmi; fmj ; mkg; ml, which allows mj and mk to be attributed to the same equivalence class,\nfollowing the principle of complementary distributions [1]. Both the length of the shared\ncontext and the cohesiveness of the equivalence class need to be taken into account in\n(b) Patterns can serve as\nestimating the goodness of the candidate pattern (see eq. 1).\nconstituents in their own right; recursively abstracting patterns from a corpus allows us\nto capture the syntactic regularities concisely, yet expressively. Abstraction also supports\ngeneralization: in this schematic illustration, two new paths (dashed lines) emerge from the\nformation of equivalence classes associated with cu and cv.\nIn the second phase, the PA algorithm repeatedly scans the RDS graph for Signi\ufb01cant\nPatterns (sequences of constituents) (SP), which are then used to modify the graph (Al-\ngorithm 1). For each path pi, the algorithm constructs a list of candidate constituents,\nci1; : : : ; cik. Each of these consists of a \u201cpre\ufb01x\u201d (sequence of graph edges), an equivalence\nclass of vertices, and a \u201csuf\ufb01x\u201d (another sequence of edges; cf. Figure 2(b)).\n\nThe criterion I 0 for judging pattern signi\ufb01cance combines a syntagmatic consideration (the\npattern must be long enough) with a paradigmatic one (its constituents c1; : : : ; ck must have\nhigh mutual information):\n\nI 0(c1; c2; : : : ; ck) = e(cid:0)(L=k)2\n\nP (c1; c2; : : : ; ck) log\n\nP (c1; c2; : : : ; ck)\n\n(cid:5)k\n\nj=1P (cj)\n\n(1)\n\nwhere L is the typical context length and k is the length of the candidate pattern; the prob-\nabilities associated with a cj are estimated from frequencies that are immediately available\n\n1We remark that the algorithm can work in any language, with any set of tokens, including indi-\n\nvidual characters \u2013 or phonemes, if applied to speech.\n\n\ffor all path 2 graph do fpath=sentence; graph=corpusg\n\nAlgorithm 1 PA (pattern acquisition), phase 2\n1: while patterns exist do\n2:\n3:\n4:\n5:\n6:\n\nfor all source node 2 path do\n\nfor all sink node 2 path do fsource and sink can be equivalence classesg\n\ndegree of separation = path index(sink) (cid:0) path index(source);\npattern table ( detect patterns(source, sink, degree of separation, equiva-\nlence table);\n\nend for\n\n7:\n8:\n9:\n10:\n11:\nend for\n12:\n13: end while\n\nend for\nwinner ( get most signi\ufb01cant pattern(pattern table);\nequivalence table ( detect equivalences(graph, winner);\ngraph ( rewire graph(graph, winner);\n\nin the graph (e.g., the out-degree of a node is related to the marginal probability of the cor-\nresponding cj). Equation 1 balances two opposing \u201cforces\u201d in pattern formation: (1) the\nlength of the pattern, and (2) the number and the cohesiveness of the set of examples that\nsupport it. On the one hand, shorter patterns are likely to be supported by more examples;\non the other hand, they are also more likely to lead to over-generalization, because shorter\npatterns mean less context.\n\nA pattern tagged as signi\ufb01cant is added as a new vertex to the RDS graph, replacing the\nconstituents and edges it subsumes (Figure 2). Note that only those edges of the multi-\ngraph that belong to the detected pattern are rewired; edges that belong to sequences not\nsubsumed by the pattern are untouched. This highly context-sensitive approach to pattern\nabstraction, which is unique to our model, allows ADIOS to achieve a high degree of\nrepresentational parsimony without sacri\ufb01cing generalization power.\n\nDuring the pass over the corpus the list of equivalence sets is updated continuously; the\nidenti\ufb01cation of new signi\ufb01cant patterns is done using the current equivalence sets (Fig-\nure 3(d)). Thus, as the algorithm processes more and more text, it \u201cbootstraps\u201d itself and\nenriches the RDS graph structure with new SPs and their accompanying equivalence sets.\nThe recursive nature of this process enables the algorithm to form more and more com-\nplex patterns, in a hierarchical manner. The relationships among these can be visualized\nrecursively in a tree format, with tree depth corresponding to the level of recursion (e.g.,\nFigure 3(c)). The PA algorithm halts if it processes a given amount of text without \ufb01nding\na new SP or equivalence set (in real-life language acquisition this process may never stop).\nGeneralization. A collection of patterns distilled from a corpus can be seen as an empir-\nical grammar of sorts; cf. [6], p.63: \u201cthe grammar of a language is simply an inventory of\nlinguistic units.\u201d The patterns can eventually become highly abstract, thus endowing the\nmodel with an ability to generalize to unseen inputs. Generalization is possible, for exam-\nple, when two equivalence classes are placed next to each other in a pattern, creating new\npaths among the members of the equivalence classes (dashed lines in Figure 1(b)). Gen-\neralization can also ensue from partial activation of existing patterns by novel inputs. This\nfunction is supported by the input module, designed to process a novel sentence by forming\nits distributed representation in terms of activities of existing patterns (Figure 6). These\nare computed by propagating activation from bottom (the terminals) to top (the patterns) of\nthe RDS. The initial activities wj of the terminals cj are calculated given the novel input\ns1; : : : ; sk as follows:\n\nwj = max\nm=1::k\n\nfI(sk; cj)g\n\n(2)\n\n\f102: do you see the cat?\n101: the cat is eating\n103: are you sure?\n\nSentence Number\n\nWithin-Sentence Index\n\n101_1\n\n101_2\n\n101_3\n\n101_4\n\n101_5\n\n101_6\n\ne\nh\n\nt\n\nm\na\nP\n\nt\n\na\nc\n\ns\n\ni\n\ny\na\np\n\nl\n\nt\n\na\ne\n\nw\no\nh\ns\n\ng\nn\n\ni\n\nr\ne\nh\n\nD\nN\nE\n\n131_1\n\n101_1\n\n109_4\n\n101_2\n\n109_5\n\n101_3\n\n109_6\n\n131_3\n\n101_4\n\n101_5\n\n109_8\n\n121_9\n\ne\nh\n\nt\n\n121_10\n\nt\n\na\nc\n\ns\nti\na\ne\n\n121_8\n\n131_2\n\ny\na\np\n\nl\n\ny\na\n\nt\ns\n\n121_12\n\ng\nn\n\ni\n\n109_9\n\n101_6\n\nD\nN\nE\n\n121_13\n\n(a)\n\n(b)\n\n(c)\n\nI\n\nN\nG\nE\nB\n\nI\n\nN\nG\nE\nB\n\nI\n\nN\nG\nE\nB\n\n109_7\n\n121_11\n\n131_3\n\ns\n\ni\n\nt\n\na\ne\n\ny\na\np\n\nl\n\ny\na\n\nt\ns\n\ng\nn\n\ni\n\nD\nN\nE\n\n101_2\n\n109_5\n\n121_9\n\n131_1\n\ne\nh\n\nt\n\n101_1\n\nt\n\na\nc\n\n131_2\n\n109_4\n\n121_8\n\n171_1\n\n171_2\n\nPATTERN 230: the cat is {eat, play, stay} -ing\n\nEquivalence Class 230:\n\nstay, eat, play\n\n171_3\n\n165_3\n\n(d)\n\nI\n\nN\nG\nE\nB\n\n221_1\n\ny\ne\nh\n\nt\n\ne\nw\n\nt\n\na\ne\n\ny\na\np\n\nl\n\ny\na\n\nt\ns\n\n221_3\n\ne\nr\ne\nh\n\n165_1\n\n165_2\n\n221_2\n\nPATTERN 231: BEGIN {they, we} {230} here\n\nFigure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101\n(the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat\nis feat, play, stayg -ing, in which two others (#109,121) already participate. (c) The\nabstracted pattern, and the equivalence class associated with it (edges that belong to se-\nquences not subsumed by this pattern, e.g., #131, are untouched). (d) The identi\ufb01cation\nof new signi\ufb01cant patterns is done using the acquired equivalence classes (e.g., #230). In\nthis manner, the system \u201cbootstraps\u201d itself, recursively distilling more and more complex\npatterns.\n\nwhere I(sk; cj) is the mutual information between sk and cj. For an equivalence class, the\nvalue propagated upwards is the strongest non-zero activation of its members; for a pattern,\nit is the average weight of the children nodes, on the condition that all the children were\nactivated by adjacent inputs. Activity propagation continues until it reaches the top nodes\nof the pattern lattice. When the algorithm encounters a novel word, all the members of\nthe terminal equivalence class contribute a value of (cid:15), which is then propagated upwards\nas usual. This enables the model to make an educated guess as to the meaning of the\nunfamiliar word, by considering the patterns that become active (Figure 6(b)).\n\n3 Results\n\nWe now brie\ufb02y describe the results of several studies designed to evaluate the viability of\nthe ADIOS model, in which it was exposed to corpora of varying size and complexity.\n\n\f(a)\n\nBEGIN\n\npropnoun:\n\n \"Joe\" | \"Beth\" |\n\"Jim\" | \"Cindy\" |\n\"Pam\" | \"George\";\n\narticle\n\n\"The\" | \"A\"\n\narticle\n\"The\"\n\nnoun:\n\n\"cat\" | \"dog\" |\n\"cow\" | \"bird\" |\n\n\"rabbit\" |\n\"horse\"\n\nnoun:\n\n\"cats\" | \"dogs\" |\n\"cows\" | \"birds\" |\n\n\"rabbits\" |\n\"horses\"\n\nthe horse is living very extremely far away.\nthe cow is working at least until Thursday.\nJim loved Pam.\nGeorge is staying until Wednesday.\nGeorge worshipped the horse.\nCindy and George have a great personality.\nPam has a fast boat.\n\n(b)\n\n(c)\n\nSentence: George is working extremely far away\nPATTERN.ID=144\nSIGNIFICANCE=0.11\nOCCURRENCES=38\nSEQUENCE=(120)+(101)\nMEAN.LENGTH=29.4\n\n144\n\n120\n\nis\n\nare\n\nverb:\n\nworking | living |\n\nplaying\n\n95\n\n70\n\n66\n\n65\n\n101\n\n98\n\n67\n\nemphasize:\n\nvery |\n\nextremely|really\n\nfar away\n\nEND\n\nt\n\nh\ne\nB\n\ny\nd\nn\nC\n\ni\n\nI\n\nN\nG\nE\nB\n\ne\ng\nr\no\ne\nG\n\nm\nJ\n\ni\n\ne\no\nJ\n\ns\n\ni\n\nm\na\nP\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nv\n\ni\nl\n\ny\na\np\n\nl\n\ny\na\nt\ns\n\ng\nn\n\ni\n\nk\nr\no\nw\n\nl\n\na\ne\nr\n\ny\n\nl\n\nr\na\nf\n\ny\na\nw\na\n\nD\nN\nE\n\ne\nm\ne\nr\nt\nx\ne\n\nFigure 3: (a) A part of a simple grammar. (b) Some sentences generated by this grammar.\n(c) The structure of a sample sentence (pattern #144), presented in the form of a tree that\ncaptures the hierarchical relationships among constituents. Three equivalence classes are\nshown explicitly (highlighted).\n\nEmergence of syntactic structures. Figure 3 shows an example of a sentence from a\ncorpus produced by a simple arti\ufb01cial grammar and its ADIOS analysis (the use of a sim-\nple grammar, constructed with Rmutt, http://www.schneertz.com/rmutt, in these initial\nexperiments allowed us to examine various properties of the model on tightly controlled\ndata). The abstract representation of the sample sentence in Figure 3(c) looks very much\nlike a parse tree, indicating that our method successfully identi\ufb01ed the grammatical struc-\nture used to generate its data. To illustrate the gradual emergence of our model\u2019s ability for\nsuch concise representation of syntactic structures, we show in Figure 4, top, four trees built\nfor the same sentence after exposing the model to progressively more data from the same\ncorpus. Note that both the number of distinct patterns and the average number of patterns\nper sentence asymptote for this corpus after exposure to about 500 sentences (Figure 4,\nbottom).\nNovel inputs; systematicity. An important characteristic of a cognitive representation\nscheme is its systematicity, measured by the ability to deal properly with structurally related\nitems (see [7] for a de\ufb01nition and discussion). We have assessed the systematicity of the\nADIOS model by splitting the corpus generated by the grammar of Figure 3 into training\nand test sets. After training the model on the former, we examined the representations of\nunseen sentences from the test set. A typical result appears in Figure 5; the general \ufb01nding\nwas of Level 3 systematicity according to the nomenclature of [7]. This example can be\nalso understood using the concept of generating novel sentences from patterns, explained\nin detail below; the novel sentence (Beth is playing on Sunday) can be produced by\nthe same pattern (#173) that accounts for the familiar sentence (the horse is playing on\nThursday) that is a part of the training corpus.\n\nThe ADIOS system\u2019s input module allows it to process a novel sentence by forming its\ndistributed representation in terms of activities of existing patterns. Figure 6 shows the\nactivation of two patterns (#141 and #120) by a phrase that contains a word in a novel\ncontext (stay), as well as another word never before encountered in any context (5pm).\n\n\f(a)\n\n70\n\nm\nJ\n\ni\n\ne\no\nJ\n\ny\nd\nn\nC\n\ni\n\ne\ng\nr\no\ne\nG\n\nt\n\nh\ne\nB\n\nI\n\nN\nG\nE\nB\n\n(d)\n\n95\n\n114\n\n89\n\n68\n\n66\n\n65\n\n66\n\n65\n\n(c)\n\n122\n\nm\nJ\n\ni\n\nI\n\nN\nG\nE\nB\n\n113\n\n72\n\ns\n\ni\n\nm\na\nP\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nv\n\ni\nl\n\ny\na\np\n\nl\n\ny\na\nt\ns\n\ns\n\ni\n\nk\nr\no\nw\n\ng\nn\n\ni\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nv\n\ni\nl\n\ny\na\np\n\nl\n\ny\na\nt\ns\n\nk\nr\no\nw\n\ng\nn\n\nt\n\na\n\ni\n\nt\ns\na\ne\n\nl\ni\nt\n\nn\nu\n\nl\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ni\nr\nF\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nr\nu\na\nS\n\nt\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ns\ne\nu\nT\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\nD\nN\nE\n\nw\no\nr\nr\no\nm\no\n\nt\n\n69\n\n66\n\n65\n\n68\n\n(b)\n\n72\n\n71\n\n66\n\n65\n\n69\n\n68\n\nt\n\nh\ne\nB\n\nI\n\nN\nG\nE\nB\n\nm\nJ\n\ni\n\ny\nd\nn\nC\n\ni\n\ne\ng\nr\no\ne\nG\n\ns\n\ni\n\ne\no\nJ\n\nv\n\ni\nl\n\ny\na\np\n\nl\n\ny\na\nt\ns\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nk\nr\no\nw\n\ng\nn\n\ni\n\nl\ni\nt\nn\nu\n\ny\na\nd\ni\nr\nF\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\nr\nu\na\nS\n\nt\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ns\ne\nu\nT\n\nD\nN\nE\n\nw\no\nr\nr\no\nm\no\n\nt\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\ns\n\ni\n\nv\n\ni\nl\n\ny\na\np\n\ny\na\nt\ns\n\nl\n\nk\nr\no\nw\n\ng\nn\n\ni\n\nt\n\na\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nl\ni\nt\n\nt\ns\na\ne\n\nn\nu\n\ny\na\nd\ni\nr\nF\n\nl\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nr\nu\na\nS\n\nt\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ns\ne\nu\nT\n\nD\nN\nE\n\nw\no\nr\nr\no\nm\no\n\nt\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\ns\nn\nr\ne\nt\nt\na\nP\nd\ne\nt\nc\ne\nt\ne\nD\n\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n \nl\na\nt\no\nT\n\n8.00\n7.00\n6.00\n5.00\n4.00\n3.00\n2.00\n1.00\n0.00\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nNumber of Sentences in the corpus\n\n \n\n \n\ns\nn\nr\ne\nt\nt\na\nP\nd\ne\nt\nc\ne\nt\ne\nD\n\n \nf\no\n \nr\ne\nb\nm\nu\nN\n \ne\ng\na\nr\ne\nv\nA\n\nFigure 4: Top: the build-up of structured information with progressive exposure to a corpus\ngenerated by the simple grammar of Figure 3. (a) Prior to exposure. (b) 100 sentences. (c)\n200 sentences. (d) 400 sentences. Bottom: the total number of detected patterns (4) and\nthe average number of patterns in a sentence ( ), plotted vs. corpus size.\n\n(a)\n\nUnseen: Beth is playing on Sunday.\n\n173\n\n(b)\n\n the horse is playing on Thursday.\n\n148\n\n147\n\n86\n\n79\n\nt\n\na\nc\n\nw\no\nc\n\ng\no\nd\n\ne\ns\nr\no\nh\n\nd\nr\ni\nb\n\ne\nh\nt\n\nh\nt\ne\nB\n\nI\n\nN\nG\nE\nB\n\ns\n\ni\n\nt\ni\n\nb\nb\na\nr\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\n93\n\n83\n\n82\n\nv\n\ni\nl\n\ny\na\np\n\nl\n\ny\na\nt\ns\n\nk\nr\no\nw\n\ng\nn\n\nn\no\n\ni\n\ny\na\nd\ni\nr\nF\n\n92\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nr\nu\na\nS\n\nt\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ns\ne\nu\nT\n\nD\nN\nE\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\n148\n\n147\n\n86\n\n79\n\nt\n\na\nc\n\nw\no\nc\n\ng\no\nd\n\ne\ns\nr\no\nh\n\nt\n\nh\ne\nB\n\ne\nh\nt\n\nd\nr\ni\nb\n\nI\n\nN\nG\nE\nB\n\n173\n\n83\n\n82\n\n93\n\n92\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nr\nu\na\nS\n\nt\n\ny\na\nd\ns\ne\nu\nT\n\nD\nN\nE\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\ns\nr\nu\nh\nT\n\ns\n\ni\n\nt\ni\n\nb\nb\na\nr\n\nt\n\na\nr\nb\ne\ne\nc\n\nl\n\nv\n\ni\nl\n\ny\na\nt\ns\n\ny\na\np\n\nl\n\nk\nr\no\nw\n\ng\nn\n\nn\no\n\ni\n\ny\na\nd\ni\nr\nF\n\nFigure 5: (a) Structured representation of an \u201cunseen\u201d sentence that had been excluded\nfrom the corpus used to learn the patterns; note that the detected structure is identical to\nthat of (b), a \u201cseen\u201d sentence. The identity between the structures detected in (a) and (b)\nis a manifestation of Level-3 systematicity of the ADIOS model (\u201cNovel Constituent: the\ntest set contains at least one atomic constituent that did not appear anywhere in the training\nset\u201d; see [7], pp.3-4).\n\n\f141... activation level: 0.972\n\n86\n\nW\n\n0\n\n=1.0\n\n74\n\n113\n\n112\n\nW\n\n8\n\n=1.0\n\nW\n\n=0.8\n\n15\n\nh\n\nC0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13\nm\nN\na\nG\nP\nE\nB\n\ny\nd\nn\nC\n\ny\nd\nn\nC\n\nm\na\nP\n\nd\nn\na\n\ne\no\nJ\n\ne\no\nJ\n\nm\nJ\n\nm\nJ\n\ne\nB\n\nt\n\nI\n\ni\n\ni\n\ni\n\ni\n\ne\ng\nr\no\ne\nG\n\ne\ng\nr\no\ne\nG\n\nh\ny\nt\na\ne\nd\nB\ns\ne\nn\nd\ne\nW\n\n2081\n\nC14 C15 C16 C17 C18\ng\nn\n\nv\n\ni\nl\n\ne\nr\na\n\ni\n\ny\na\np\n\nl\n\nk\nr\no\nw\n\n120... activation level: 0.667\n\n119\n\nW\n\n=1.0\n\n0\n\nW\n\n=1.0\n\n13\n\nW\n\n=e\n\n1\n\nW\n\n2..8\n\n=e\n\n100\n\n93\n\n89\n\nC0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13\n\nl\ni\nt\nn\nu\n\nw\no\nr\nr\no\nm\no\nt\n\ny\na\nd\ni\nr\nF\n\ny\na\nd\nn\no\nM\n\ny\na\nd\nr\nu\nt\na\nS\n\ny\na\nd\nn\nu\nS\n\ny\na\nd\ns\nr\nu\nh\nT\n\ny\na\nd\ns\ne\nu\nT\n\nt\nx\ne\nn\n\nt\n\nh\nn\no\nm\n\nk\ne\ne\nw\n\ny\na\nd\ns\ne\nn\nd\ne\nW\n\nt\n\nr\ne\nn\nw\n\ni\n\nD\nN\nE\n\n1726\n\n2080\n\n1734\n\nthe input module in action (the two most relevant \u2013 highly active \u2013 patterns\nFigure 6:\n1865\nresponding to the input Joe and Beth are staying until 5pm). Leaf activation is propor-\ntional to the mutual information between inputs and various members of the equivalence\nclasses (e.g., on the left W15 = 0:8 is the mutual information between stay and liv, which is\na member of equivalence class #112). It is then propagated upwards by taking the average\nat each junction.\n\nN\nG\nE\nB\n\nw\no\nh\ns\n\np\ne\nh\n\ne\nv\ng\n\nn\na\nc\n\nu\no\ny\n\ne\ne\ns\n\ne\nm\n\nr\ne\nh\n\nd\nd\n\ne\nw\n\ne\ng\n\nu\nb\n\nd\n\ne\n\nl\nl\n\nl\ni\n\nI\n\nt\n\nt\n\ni\n\ni\n\nl\n\n2077\n\n2076\n\nWorking with real data: the CHILDES corpus. To illustrate the scalability of our\nmethod, we describe here brie\ufb02y the outcome of applying the PA algorithm to a subset of\nthe CHILDES collection [8], which consists of transcribed speech produced by, or directed\nat, children. The corpus we selected contained 9665 sentences (74500 words) produced\nby parents. The results, one of which is shown in Figure 7, were encouraging: the algo-\nrithm found intuitively signi\ufb01cant SPs and produced semantically adequate corresponding\nequivalence sets. Altogether, 1062 patterns and 775 equivalence classes were established.\nRepresenting the corpus in terms of these constituents resulted in a signi\ufb01cant compres-\nsion: the average number of constituents per sentence dropped from 6:70 in the raw data to\n2:18 after training, and the entropy per letter was reduced from 2:6 to 1:5.\n\n1785\n\n1828\n\n1558\n\n1829\n\n1739\n\n1785\n\n1398\n\nN\nG\nE\nB\n\ny\nn\nn\nu\nb\n\nt\nr\na\nt\ns\n\ny\ne\nh\n\nu\no\ny\n\nk\no\no\n\nk\no\no\n\nt\ns\nu\n\no\ng\n\no\ng\n\ng\nn\n\no\nn\n\na\nh\n\ne\nh\n\ne\nr\n\u2019\n\na\n\ns\n\nt\n\nI\n\nt\n\nt\n\nt\n\nt\n\nt\n\nj\n\ni\n\nl\n\nl\n\n1960\n\n1959\n\n1914\n\n1912\n\nCHILDES_2764 : \n\nthey don \u2019t want ta go for a ride ? \n\nyou don \u2019t want ta look for another ride ? \n\nCHILDES_2642 :\n\ncan we make a little house ? \n\nshould we make another little dance ? \n\nCHILDES_2504 :\n\nshould we put the bed s in the house ? \n\n \n\nshould we take some doggie s on that house ? \n\n1629\n\n1739\n\n1407\n\n1656\n\n1913\n\nCHILDES_1038 : \n\nwhere \u2019d the what go ? \n\n \n\nwhere are the what \u2019 s he gon ta do go ? \n\n\u2019 s\n\ne\nr\ne\nh\nw\n\nI\n\nN\nG\nE\nB\n\ny\nk\nc\ne\nB\n\nn\ne\nn\nn\ne\nr\nB\n\nc\ni\nr\n\nE\n\ny\nf\nf\ni\n\nM\n\ny\nm\nm\no\nm\n\nt\n\na\nh\n\nt\n\ne\nh\n\nt\n\ne\nh\n\nt\n\ni\n\ng\nb\n\ne\nu\nb\n\nl\n\nt\ns\ne\ng\ng\nb\n\ni\n\nt\n\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\ny\ns\na\ne\n\ne\n\nl\nt\nt\ni\nl\n\nt\ns\ne\n\nl\nt\nt\ni\nl\n\nt\nx\ne\nn\n\nt\n\nh\ng\ni\nr\n\nd\nn\nu\no\nr\n\ne\nr\na\nu\nq\ns\n\ne\n\nt\ni\n\nh\nw\n\nr\ne\nh\no\n\nt\n\nw\no\n\nl\nl\n\ne\ny\n\nn\ne\ne\nr\ng\n\ne\ng\nn\na\nr\no\n\nw\no\n\nl\nl\n\ne\ny\n\ne\nn\no\n\nm\no\no\nr\n\n?\n\ne\nd\ns\n\ni\n\ny\na\nw\n\nn\ne\nk\nc\nh\nc\n\ni\n\nCHILDES_2304 : \n\nwant Mommy to show you ?\n\n \n\nlike her to help they ? \n\nFigure 7: Left: a typical pattern extracted from a subset of the CHILDES corpora collec-\ntion [8]. Hundreds of such patterns and equivalence classes (underscored in this \ufb01gure)\ntogether constitute a concise representation of the raw data. Some of the phrases that can\nbe described/generated by pattern 1960 are: where\u2019s the big room?; where\u2019s the yellow\none?; where\u2019s Becky?; where\u2019s that?. Right: some of the phrases generated by ADIOS\n(lower lines in each pair) using sentences from CHILDES (upper lines) as examples. The\ngeneration module works by traversing the top-level pattern tree, stringing together lower-\nlevel patterns and selecting randomly one member from each equivalence class. Extensive\ntesting (currently under way) is needed to determine whether the grammaticality of the\nnewly generated phrases (which is at present less than ideal, as can be seen here) improves\nwith more training data.\n\n\f4 Concluding remarks\n\nWe have described a linguistic pattern acquisition algorithm that aims to achieve a stream-\nlined representation by compactly representing recursively structured constituent patterns\nas single constituents, and by placing strings that have an identical backbone and similar\ncontext structure into the same equivalence class. Although our pattern-based represen-\ntations may look like collections of \ufb01nite automata, the information they contain is much\nricher, because of the recursive invocation of one pattern by another, and because of the\ncontext sensitivity implied by relationships among patterns. The sensitivity to context of\npattern abstraction (during learning) and use (during generation) contributes greatly both to\nthe conciseness of the ADIOS representation and to the conservative nature of its generative\nbehavior. This context sensitivity \u2014 in particular, the manner whereby ADIOS balances\nsyntagmatic and paradigmatic cues provided by the data \u2014 is mainly what distinguishes it\nfrom other current work on unsupervised probabilistic learning of syntax, such as [9, 10, 4].\n\nIn summary, \ufb01nding a good set of structured units leads to the emergence of a convergent\nrepresentation of language, which eventually changes less and less with progressive expo-\nsure to more data. The power of the constituent graph representation stems from the inter-\nacting ensembles of patterns and equivalence classes that comprise it. Together, the local\npatterns create global complexity and impose long-range order on the linguistic structures\nthey encode. Some of the challenges implicit in this approach that we leave for future work\nare (1) interpreting the syntactic structures found by ADIOS in the context of contemporary\ntheories of syntax, and (2) relating those structures to semantics.\n\nAcknowledgments. We thank Regina Barzilai, Morten Christiansen, Dan Klein, Lillian Lee\nand Bo Pang for useful discussion and suggestions, and the US-Israel Binational Science\nFoundation, the Dan David Prize Foundation, the Adams Super Center for Brain Studies at\nTAU, and the Horowitz Center for Complexity Science for \ufb01nancial support.\n\nReferences\n[1] Z. S. Harris. Distributional structure. Word, 10:140\u2013162, 1954.\n[2] R. Kazman. Simulating the child\u2019s acquisition of the lexicon and syntax - experiences\n\nwith Babel. Machine Learning, 16:87\u2013120, 1994.\n\n[3] J. L. Elman. Finding structure in time. Cognitive Science, 14:179\u2013211, 1990.\n[4] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction\nsystems: Alignment-based learning vs. EMILE. Report 05, School of Computing,\nLeeds University, 2001.\n\n[5] M. Gross. The construction of local grammars.\n\nIn E. Roche and Y. Schab`es, ed.,\n\nFinite-State Language Processing, 329\u2013354. MIT Press, Cambridge, MA, 1997.\n\n[6] R. W. Langacker. Foundations of cognitive grammar, volume I: theoretical prerequi-\n\nsites. Stanford University Press, Stanford, CA, 1987.\n\n[7] T. J. van Gelder and L. Niklasson. On being systematically connectionist. Mind and\n\nLanguage, 9:288\u2013302, 1994.\n\n[8] B. MacWhinney and C. Snow. The child language exchange system. Journal of\n\nComputational Lingustics, 12:271\u2013296, 1985.\n\n[9] D. Klein and C. D. Manning. Natural\n\nlanguage grammar induction using a\nconstituent-context model. In T. G. Dietterich, S. Becker, and Z. Ghahramani, ed.,\nAdv. in Neural Information Proc. Systems 14, Cambridge, MA, 2002. MIT Press.\n\n[10] A. Clark. Unsupervised Language Acquisition: Theory and Practice. PhD thesis,\n\nCOGS, University of Sussex, 2001.\n\n\f", "award": [], "sourceid": 2253, "authors": [{"given_name": "Zach", "family_name": "Solan", "institution": null}, {"given_name": "Eytan", "family_name": "Ruppin", "institution": null}, {"given_name": "David", "family_name": "Horn", "institution": null}, {"given_name": "Shimon", "family_name": "Edelman", "institution": null}]}