{"title": "Modeling the effects of memory on human online sentence processing with particle filters", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 944, "abstract": "Language comprehension in humans is significantly constrained by memory, yet rapid, highly incremental, and capable of utilizing a wide range of contextual information to resolve ambiguity and form expectations about future input. In contrast, most of the leading psycholinguistic models and fielded algorithms for natural language parsing are non-incremental, have run time superlinear in input length, and/or enforce structural locality constraints on probabilistic dependencies between events. We present a new limited-memory model of sentence comprehension which involves an adaptation of the particle filter, a sequential Monte Carlo method, to the problem of incremental parsing. We show that this model can reproduce classic results in online sentence comprehension, and that it naturally provides the first rational account of an outstanding problem in psycholinguistics, in which the preferred alternative in a syntactic ambiguity seems to grow more attractive over time even in the absence of strong disambiguating information.", "full_text": "Modeling the effects of memory on human online\n\nsentence processing with particle \ufb01lters\n\nRoger Levy\n\nDepartment of Linguistics\n\nUniversity of California, San Diego\n\nFlorencia Reali Thomas L. Grif\ufb01ths\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nrlevy@ling.ucsd.edu\n\n{floreali,tom griffiths}@berkeley.edu\n\nAbstract\n\nLanguage comprehension in humans is signi\ufb01cantly constrained by memory, yet\nrapid, highly incremental, and capable of utilizing a wide range of contextual\ninformation to resolve ambiguity and form expectations about future input.\nIn\ncontrast, most of the leading psycholinguistic models and \ufb01elded algorithms for\nnatural language parsing are non-incremental, have run time superlinear in input\nlength, and/or enforce structural locality constraints on probabilistic dependencies\nbetween events. We present a new limited-memory model of sentence comprehen-\nsion which involves an adaptation of the particle \ufb01lter, a sequential Monte Carlo\nmethod, to the problem of incremental parsing. We show that this model can\nreproduce classic results in online sentence comprehension, and that it naturally\nprovides the \ufb01rst rational account of an outstanding problem in psycholinguistics,\nin which the preferred alternative in a syntactic ambiguity seems to grow more\nattractive over time even in the absence of strong disambiguating information.\n\n1 Introduction\n\nNearly every sentence occurring in natural language can, given appropriate contexts, be interpreted\nin more than one way. The challenge of comprehending a sentence is identifying the intended\nintepretation from among these possibilities. More formally, each interpretation of a sentence w can\nbe associated with a structural description T , and to comprehend a sentence is to infer T from w \u2013\nparsing the sentence to reveal its underlying structure. From a probabilistic perspective, this requires\ncomputing the posterior distribution P (T |w) or some property thereof, such as the description T\nwith highest posterior probability. This probabilistic perspective has proven extremely valuable in\ndeveloping both effective methods by which computers can process natural language [1, 2] and\nmodels of human language processing [3].\nIn real life, however, people receive nearly all linguistic input incrementally: sentences are spoken,\nand written sentences are by and large read, from beginning to end. There is considerable evidence\nthat people also comprehend incrementally, making use of linguistic input moment by moment to re-\nsolve structural ambiguity and form expectations about future inputs [4, 5]. The incremental parsing\nproblem can, roughly, be stated as the problem of computing the posterior distribution P (T |w1 ...i )\nfor a partial input w1 ...i . To be somewhat more precise, incremental parsing involves constructing a\ndistribution over partial structural descriptions of w1 ...i which implies the posterior P (T |w1 ...i ). A\nvariety of \u201crational\u201d models of online sentence processing [6, 7, 8, 9] take exactly this perspective,\nusing the properties of P (T |w1 ...i ) or quantities derived from it to explain why people \ufb01nd some\nsentences more dif\ufb01cult to comprehend than others.\nDespite their success in capturing a variety of psycholinguistic phenomena, existing rational mod-\nels of online sentence processing leave open a number of questions, both theoretical and empirical.\nOn the theoretical side, these models assume that humans are \u201cideal comprehenders\u201d capable of\ncomputing P (T |w1 ...i ) despite its signi\ufb01cant computational cost. This kind of idealization is com-\nmon in rational models of cognition, but raises questions about how resource constraints might\naffect language processing. For structured probabilistic formalisms in widespread use in compu-\n\n\ftational linguistics, such as probabilistic context-free grammars (PCFGs), incremental processing\nalgorithms exist that allow the exact computation of the posterior (implicitly represented) in poly-\nnomial time [10, 11, 12], from which k-best structures [13] or samples from the posterior [14] can\nbe ef\ufb01ciently obtained. However, these algorithms are psychologically implausible for two reasons:\n(1) their run time (both worst-case and practical) is superlinear in sentence length, whereas human\nprocessing time is essentially linear in sentence length; and (2) the probabilistic formalisms utilized\nin these algorithms impose strict locality conditions on the probabilistic dependence between events\nat different levels of structure, whereas humans seem to be able to make use of arbitrary features of\n(extra-)linguistic context in forming incremental posterior expectations [4, 5].\nTheoretical questions about the mechanisms underlying online sentence processing are comple-\nmented by empirical data that are hard to explain purely in probabilistic terms. For example, one of\nthe most compelling phenomena in psycholinguistics is that of garden-path sentences, such as:\n\n(1) The woman brought the sandwich from the kitchen tripped.\n\nComprehending such sentences presents a signi\ufb01cant challenge, and many readers fail completely\non their \ufb01rst attempt. However, the sophisticated dynamic programming algorithms typically used\nfor incremental parsing implicitly represent all possible continuations of a sentence, and are thus\nable to recover the correct interpretation in a single pass. Another phenomenon that is hard to\nexplain simply in terms of the probabilities of interpretations of a sentence is the \u201cdigging in\u201d effect,\nin which the preferred alternative in a syntactic ambiguity seems to grow more attractive over time\neven in the absence of strong disambiguating information [15].\nIn this paper, we explore the hypothesis that these phenomena can be explained as the consequence\nof constraints on the resources available for incremental parsing. Previous work has addressed the\nissues of feature locality and resource constraints by adopting a pruning approach, in which hard\nlocality constraints on probabilistic dependence are abandoned and only high-probability candidate\nstructures are maintained after each step of incremental parsing [6, 16, 17, 18]. These approaches\ncan be thought of as focusing on holding on to the highest posterior-probability parse as often as\npossible. Here, we look to the machine learning literature to explore an alternative approach focused\non approximating the posterior distribution P (T |w1 ...i ). We use particle \ufb01lters [19], a sequential\nMonte Carlo method commonly used for approximate probabilistic inference in an online setting, to\nexplore how the computational resources available in\ufb02uence the comprehension of sentences. This\napproach builds on the strengths of rational models of online sentence processing, allowing us to\nexamine how performance degrades as the resources of the ideal comprehender decrease.\nThe plan of the paper is as follows. Section 2 introduces the key ideas behind particle \ufb01lters, while\nSection 3 outlines how these ideas can be applied in the context of incremental parsing. Section 4\nillustrates the approach for the kind of garden-path sentence given above, and Section 5 presents an\nexperiment with human participants testing the predictions that the resulting model makes about the\ndigging-in effect. Section 6 concludes the paper.\n\n2 Particle \ufb01lters\n\nParticle \ufb01lters are a sequential Monte Carlo method typically used for probabilistic inference in\ncontexts where the amount of data available increases over time [19]. The canonical setting in which\na particle \ufb01lter would be used involves a sequence of latent variables z1 , . . . , zn and a sequence of\nobserved variables x1 , . . . , xn, with the goal of estimating P (zn |x1 ...n ). The particle \ufb01lter solves\nthis problem recursively, relying on the fact that the chain rule gives\n\nP (zn |x1 ...n ) \u221d P (xn |zn ) X z n\u22121 P (zn |zn\u22121 )P (zn\u22121 |x1 ...n\u22121 )\n\n(1)\n\nwhere we assume xn and zn are independent of all other variables given zn and zn\u22121 respectively.\nAssume we know P (zn\u22121 |x1 ...n\u22121 ). Then we can use this distribution to construct an impor-\ntance sampler for P (zn |x1 ...n ). We generate several values of zn\u22121 from P (zn\u22121 |x1 ...n\u22121 ).\nThen, we draw zn from P (zn |zn\u22121 ) for each instance of zn\u22121 , to give us a set of values\nfrom P (zn |x1 ...n\u22121 ). Finally, we assign each value of zn a weight proportional to P (xn |zn ),\nto give us an approximation to P (zn |x1 ...n ). The particle \ufb01lter is simply the recursive version\nof this algorithm, in which a similar approximation was used to construct to P (zn\u22121 |x1 ...n\u22121 )\nfrom P (zn\u22122 |x1 ...n\u22122 ) and so forth. The algorithm thus approximates P (zn\u22121 |x1 ...n\u22121 ) with\na weighted set of \u201cparticles\u201d \u2013 discrete values of zi \u2013 which are updated using P (zn |zn\u22121 ) and\n\n\fP (xn |zn ) to provide an approximation to P (zn |x1 ...n ). The particle \ufb01lter thus has run-time linear\nin the number of observations, and provides a way to explore the in\ufb02uence of memory capacity (re-\n\ufb02ected in the number of particles) on probabilistic inference (cf. [20, 21]). In this paper, we focus\non the conditions under which the particle \ufb01lter fails as a source of information about the challenges\nof limited memory capacity for online sentence processing.\n\n3 Incremental parsing with particle \ufb01lters\n\nIn this section we develop an algorithm for top-down, incremental particle-\ufb01lter parsing. We \ufb01rst\nlay out the algorithm, then consider options for representations and grammars.\n\n3.1 The basic algorithm\n\nWe assume that the structural descriptions of a sentence are context-free trees, as might be produced\nby a PCFG. Without loss of generality, we also assume that preterminal expansions are always\nunary rewrites. A tree is generated incrementally in a sequence of derivation operations \u03c01 ...m,\nsuch that no word can be generated unless all the words preceding it in the sentence have already\nbeen generated. The words of the sentence can thus be considered observations, and the hidden state\nis a partial derivation (D, S), where D is an incremental tree structure and S is a stack of items of\nthe form hN, Opi, where N is a target node in D and Op is a derivation operation type. Later in this\nsection, we outline three possible derivation orders.\nThe problem of inferring a distribution over partial derivations from observed words can be approx-\nimated using particle \ufb01lters as outlined in Section 2. Assume a model that speci\ufb01es a probability\ndistribution P (\u03c0|(D, S), w1 ...i ) over the next derivation operation \u03c0 given the current partial deriva-\n\u03c01 ...j\u21d2 (D\u2032, S\u2032) we denote that the sequence of derivation\ntion and words already seen. By (D, S)\noperations \u03c01 ...j takes the partial derivation (D, S) to a new partial derivation (D\u2032, S\u2032). Now con-\nsider a partial derivation (Di|, Si|) in which the most recent derivation operation has generated the\nith word in the input. Through the \u03c0\u21d2 relation, our model implies a probability distribution over new\npartial derivations in which the next operation would be the generation of the i + 1th word; call this\ndistribution P ((D|i +1 , S|i +1 )|(Di|, Si|)). In the nomenclature of particle \ufb01lters introduced above,\npartial derivations (D|i , S|i ) thus correspond to latent variables zi , words wi to observations xi ,\nand our importance sampler involves drawing from P ((D|i , S|i )|(Di\u22121 |, Si\u22121 |)) and reweighting\nby P (wi |(D|i , S|i )). This differs from the standard particle \ufb01lter only in that zi is not necessarily\nindependent of x1 ...i\u22121 given zi\u22121 .\n\n3.2 Representations and grammars\n\nWe now describe three possible derivation orders that can be used with our approach. For each order,\na derivation operation \u03c0Op of a given type Op speci\ufb01es a sequence of symbols Y 1 . . . Y k (possibly\n\u03c0Op\u21d2 (D\u2032, A\u2295S),\nthe empty sequence \u01eb), and can be applied to a partial derivation: (D, [hN, Opi]\u2295S)\nwith \u2295 being list concatenation. That is, a derivation operation involves popping the top item off the\nstack, choosing a derivation operation of the appropriate type, applying it to add some symbols to D\nyielding D\u2032, and pushing a list of new items A back on the stack. Derivation operations differ in the\nrelationship between D and D\u2032, and derivation orders differ in the contents of A.\n\nOrder 1: Expansion (Exp) only. D\u2032 consists of D with node N expanded to have daughters\n\nY 1 . . . Y k ; and A = [hY 1 , Expi, . . . , hY k , Expi].\n\nOrder 2: Expansion and Right-Sister (Sis). The sequence of symbols speci\ufb01ed by any \u03c0Op is of\nmaximum length 1. Expansion operations affect D as above. For a right-sister operation\n\u03c0Sis, D\u2032 consists of D with Y 1 added as the right sister of N (if \u03c0Sis speci\ufb01es \u01eb, then\nD = D\u2032). A = [hY 1 , Expi, hY 1 , Sisi, . . . , hY k , Expi, hY k , Sisi].\n\nOrder 3: Expansion, Right-Sister, and Adjunction (Adj). The sequence of symbols speci\ufb01ed\nby any \u03c0Op is of maximum length 1. Expansion operations affect D as above. Ex-\npansion and right-sister operations are as above. For a right-sister operation \u03c0Adj, D\u2032\nconsists of D with Y 1 spliced in at the node N \u2013 that is, Y 1 replaces N in the tree,\nand N becomes the lone daughter of Y 1 (if \u03c0Adj speci\ufb01es \u01eb, then D = D\u2032). A =\n[hY 1 , Expi, hY 1 , Sisi, hY 1 , Adji, . . . , hY k , Expi, hY k , Sisi, hY k , Adji].\n\n\fD\n\nROOT\n\nS1\n\nS2\n\nCC S3\n\nNP\n\nVP\n\nN\n\nVBD ADVP\n\nPat\n\nS\n\nhVBD, Expi\nhADVP, Expi\nhCC, Expi\nhS3, Expi\n\nD\nROOT\n\nS1\nS2\n\nNP\n\nVP\n\nN\n\nVBD\n\nPat\n\nS\n\nhVBD, Expi\nhVBD, Sisi\nhVP, Sisi\nhS2, Sisi\nhS1, Sisi\n\nD\nROOT\n\nS2\n\nNP\n\nVP\n\nN\n\nVBD\n\nPat\n\nS\n\nhVBD, Expi\nhVBD, Sisi\nhVBD, Adji\nhVP, Sisi\nhVP, Adji\nhS2, Sisi\nhS2, Adji\n\n(c)\n\n(a)\n\n(b)\n\nFigure 1: Three possible derivation orders for the sentence \u201cPat walked yesterday and Sally slept\u201d.\nIn each case, the partial derivation (D|i , S|i ) is shown for i = 2 \u2013 up to just before the generation of\nthe word \u201cwalked\u201d. The symbols ADVP, CC, and S3 in (a) will be generated later in the derivations\nof (b) and (c) as right-sister operations; the symbol S1 will be generated in (c) as an adjunction\noperation. During the incremental parsing of \u201cwalked\u201d these partial derivations would be reweighted\nby P Exp(walked |(D|i , S|i )).\n\nthe initial state of a derivation is a root symbol\n\nIn all cases,\ntargeted for expansion:\n(ROOT, [hROOT, Expi]), and a derivation is complete when the stack is empty. Figure 1 illustrates\nthe partial derivation state for each order just after the generation of a word in mid-sentence.\nFor each derivation operation type Op, it is necessary to de\ufb01ne an underlying grammar and estimate\nthe parameters of a distribution P Op(\u03c0|(D, S)) over next derivation operations given the current state\nof the derivation. For a sentence whose tree structure is known, the sequence of derivation operations\nfor derivation orders 1 and 2 is unambiguous and thus supervised training can be used for such a\nmodel. For derivation order 3, a known tree structure still underspeci\ufb01es the order of derivation\noperations, so the underlying sequence of derivation operations could either be canonicalized or\ntreated as a latent variable in training. Finally, we note that a known PCFG could be encoded in a\nmodel using any of these derivation orders; for PCFGs, the partial derivation representations used\nin order 3 may be thought of as marginalizing over the unary chains on the right frontier of the\nrepresentations in order 2, which in turn may be thought of as marginalizing over the extra childless\nnonterminals in the incremental representations of order 1.\nIn the context of the particle \ufb01lter,\nthe representations with more operation types could thus be expected to function as having larger\neffective sample sizes for a \ufb01xed number of particles [22]. For the experiments reported in this paper,\nwe use derivation order 2 with a PCFG trained using unsmoothed relative-frequency estimation on\nthe parsed Brown corpus.\nThis approach has several attractive features for the modeling of online human sentence comprehen-\nsion. The number of particles can be considered a rough estimate of the quantity of working memory\nresources devoted to the sentence comprehension task; as we will show in Section 5, sentences dif-\n\ufb01cult to parse can become easier when more particles are used. After each word, the incremental\nposterior over partial structures T can be read off the particle structures and weights. Finally, the\napproximate surprisal of each word \u2013 a quantity argued to be correlated with many types of process-\ning dif\ufb01culty in sentence comprehension [8, 9, 23] \u2013 is essentially a by-product of the incremental\nparsing process: it is the negative log of the mean (unnormalized) weight P (wi |(D|i , S|i )).\n\n4 The garden-path sentence\n\nTo provide some intuitions about our approach, we illustrate its ability to model online disambigua-\ntion in sentence comprehension using the garden-path sentence given in Example 1. In this sentence,\na local structural ambiguity is introduced at the word brought due to the fact that this word could\nbe either (i) a past-tense verb, in which case it is the main verb of the sentence and The woman is\nits complete subject; or (ii) a participial verb, in which case it introduces a reduced relative clause,\nThe woman is its recipient, and the subject of the main clause has not yet been completed. This\nambiguity is resolved in favor of (ii) by the word tripped, the main verb of the sentence. It is well\ndocumented (e.g., [24]) that locally ambiguous sentences such as Example 1 are read more slowly\nat the disambiguating region when compared with unambiguous counterparts (c.f. The woman who\n\n\f0.66 0.64\n\nS\n\nS\n\n0.91\nS\n\n0.90\nS\n\n0.89\nS\n\n0.90\n\nS\n\n0.91\n\n0.87\n\nS\n\nS\n\n0.37\nN/A\n\nNP\n\nNP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nDT\n\nDT NN\n\nDT NN\n\nVBD\n\nDT NN\n\nVBD NP\n\nDT NN\n\nVBD NP\n\nDT NN\n\nVBD NP\n\nPP\n\nDT NN\n\nVBD NP\n\nPP\n\nDT NN\n\nVBD NP\n\nPP\n\nDT\n\nDT NN\n\nDT NN\n\nIN\n\nDT NN\n\nIN NP\n\nDT NN\n\nIN NP\n\n0.15 0.19\nS\n\nS\n\n0.04\nS\n\n0.04\nS\n\n0.04\n\nS\n\nNP\n\n0.05\n\nS\n\nNP\n\nDT\n\n0.03\n\nDT NN\n\n0.05\n\nS\n\nNP\n\nS\n\nNP\n\nNP\n\nNP\n\nNP\n\nNP\n\nNP\n\nNP\n\nNP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\nNP\n\nVP\n\n0.63\n\nS\n\nVP\n\nVBD\n\nDT\n\nDT NN\n\nDT NN\n\nVBN\n\nDT NN\n\nVBN NP\n\nDT NN\n\nVBN NP\n\nDT NN\n\nVBN NP\n\nPP\n\nDT NN\n\nVBN NP\n\nPP\n\nDT NN\n\nVBN NP\n\nPP\n\nDT NN\n\nVBN NP\n\nPP\n\nDT\n\nDT NN\n\nDT NN\n\nIN\n\nDT NN\n\nIN NP\n\nDT NN\n\nIN NP\n\nDT NN\n\nIN NP\n\nThewoman brought\n1.00 0.99\n0.4 0.09\n\n0.96\n0.29\n\nthe\n0.92\n0.78\n\nsandwich\n\n0.90\n0.09\n\nfrom\n0.83\n0.3\n\nDT\n\nthe\n0.83\n0.2\n\nDT NN\nkitchen\n\n0.83\n0.09\n\nDT NN\ntripped\n0.17\n\n4\n\nFigure 2: Incremental parsing of a garden-path sentence. Trees indicate the canonical structures for\nmain-verb (above) and reduced-relative (below) interpretations. Numbers above the trees indicate\nthe posterior probabilites of main-verb and reduced-relative interpretations, marginalizing over pre-\ncise details of parse structure, as estimated by a parser using 1000 particles. Since the grammar is\nquite noisy, the main-verb interpretation still has some posterior probability after disambiguation at\ntripped. Numbers in the second-to-last line indicate the proportion of particle \ufb01lters with 20 parti-\ncles that produce a viable parse tree including the given word. The \ufb01nal line indicates the variance\n(\u00d710\u22123 ) of particle weights after parsing each word.\n\nwas brought the sandwich from the kitchen tripped), and in cases where the local bias strongly favors\n(i), many readers may fail to recover the correct reading altogether.\nFigure 2 illustrates the behavior of the particle \ufb01lter on the garden-path sentence in Example 1.\nThe word brought shifts the posterior strongly toward the main-verb interpretation. The rest of\nthe reduced relative clause has little effect on the posterior, but the disambiguator tripped shifts\nthe posterior in favor of the correct reduced-relative interpretation. In low-memory situations, as\nrepresented by a particle \ufb01lter with a small number of particles (e.g., 20), the parser is usually\nable to construct an interpretation for the sentence up through the word kitchen, but fails at the\ndisambiguator, and when it succeeds the variance in particle weights is high.\n\n5 Exploring the \u201cdigging in\u201d phenomenon\n\nAn important feature distinguishing \u201crational\u201d models of online sentence comprehension [6, 7, 8, 9]\nfrom what are sometimes called \u201cdynamical systems\u201d models [25, 15] is that the latter have an\ninternal feedback mechanism:\nin the absence of any biasing input, the activation of the leading\ncandidate interpretation tends to grow with the passage of time. A body of evidence exists in the\npsycholinguistic literature that seems to support an internal feedback mechanism: increasing the\nduration of a local syntactic ambiguity increases the dif\ufb01culty of recovery at disambiguation to the\ndisfavored interpretation. It has been found, for example, that 2a and 3a, in which the second NP (the\ngossip. . . /the deer. . . ) initially seems to be the object of the preceding verb, are harder to recover\nfrom than 2b and 3b [26, 27, 15].\n\n(2)\n\n\u201cNP/S\u201d ambiguous sentences\n\na. Long (A-L): Tom heard the gossip about the neighbors wasn\u2019t true.\n\nb. Short (A-S): Tom heard the gossip wasn\u2019t true.\n\n(3)\n\n\u201cNP/Z\u201d ambiguous sentences\n\na. Long (A-L): While the man hunted the deer that was brown and graceful ran into the woods.\n\nb. Short (A-S): While the man hunted the deer ran into the woods.\n\n\fFrom the perspective of exact rational inference \u2013 or even for rational pruning models such as [6]\n\u2013 this \u201cdigging in\u201d effect is puzzling.1 The result \ufb01nds an intuitive explanation, however, in our\nlimited-memory particle-\ufb01lter model. The probability of parse failure at the disambiguating word\nwi is a function of (among other things) the immediately preceding estimated posterior probability\nof the disfavored interpretation. If this posterior probability is low, then the resampling of particles\nperformed after processing each word provides another point at which particles representing the\ndisfavored interpretation could be deleted. Consequently, total parse failure at the disambiguator\nwill become more likely the greater the length of the preceding ambiguous region.\nWe quantify these predictions by assuming that the more often no particle is able to integrate a given\nword wi in context \u2013 that is, P (wi |(D|i , S|i )) \u2013 the more dif\ufb01cult, on average, people should \ufb01nd\nwi to read. In the sentences of Examples 2-3, by far the most likely position for the incremental\nparser to fail is at the disambiguating verb. We can also compare processing of these sentences with\nsyntactically similar but unambiguous controls.\n\n(4)\n\n(5)\n\n\u201cNP/S\u201d unambiguous controls\na. Long (U-L): Tom heard that the gossip about the neighbors wasn\u2019t true.\nb. Short (U-S): Tom heard that the gossip wasn\u2019t true.\n\u201cNP/Z\u201d unambiguous controls\na. Long (U-L): While the man hunted, the deer that was brown and graceful ran into the woods.\nb. Short (U-S): While the man hunted, the deer ran into the woods.\n\nFigure 3a shows, for each sentence of each type, the proportion of runs in which the parser suc-\ncessfully integrated (assigned non-zero probability to) the disambiguating verb (was in Example 2a\nand ran in Example 3a), among those runs in which the sentence was successfully parsed up to the\npreceding word. Consistent with our intuitive explanation, both the presence of local ambiguity and\nlength of the preceding region make parse failure at the disambiguator more likely.\nIn the remainder of this section we test this explanation with an of\ufb02ine sentence acceptability study\nof digging-in effects. The experiment provides a way to make more detailed comparisons between\nthe model\u2019s predictions and sentence acceptability. Consistent with the predictions of the model,\nratings show differences in the magnitude of digging-in effects associated with different types of\nstructural ambiguities. As the working-memory resources (i.e. number of particles) devoted to com-\nprehension of the sentence increase, the probability of successful comprehension goes up, but local\nambiguity and length of the second NP remain associated with greater comprehension dif\ufb01culty.\n\n5.1 Method\n\nThirty-two native English speakers from the university subject pool completed a questionnaire cor-\nresponding to the complexity-rating task. Forty experimental items were tested with four condi-\ntions per item, counterbalanced across questionnaires, plus 84 \ufb01llers, with sentence order pseudo-\nrandomized. Twenty experimental items were NP/S sentences and twenty were NP/Z sentences. We\nused a 2 \u00d7 2 design with ambiguity and length of the ambiguous noun phrase as factors. In NP/S\nsentences, structural ambiguity was manipulated by the presence/absence of the complementizer\nthat, while in NP/Z sentences, structural ambiguity was manipulated by the absence/presence of a\ncomma after the \ufb01rst verb. Participants were asked to rate how dif\ufb01cult to understand sentences are\non a scale from 0 to 10, 0 indicating \u201cVery easy\u201d and 10 \u201cVery dif\ufb01cult\u201d.\n\n5.2 Results and Discussion\n\nFigure 3b shows the mean complexity rating for each type of sentences. For both NP/S and NP/Z\nsentences, the ambiguous long-subject (A-L) was rated the hardest to understand, and the unam-\nbiguous short-subject (U-S) condition was rated the easiest; these results are consistent with model\npredictions. Within sentence type, the ratings were subjected to an analysis of variance (ANOVA)\nwith two factors: ambiguity and length. In the case of NP/S sentences there was a main effect of\nambiguity, F 1(1, 31) = 12.8, p < .001, F 2(1, 19) = 47.8, p < .0001 and length, F 1(1, 31) = 4.9,\n\n1For these examples, noun phrase length is a weakly misleading cue \u2013 objects tend to be longer than sub-\njects \u2013 and that these \u201cdigging in\u201d examples might also be analyzable as cases of exact rational inference [9].\nHowever, the effects of length in some of the relevant experiments are quite strong. The explanation we offer\nhere would magnify the effects of weakly misleading cues, and also extend to where cues are neutral or even\nfavor the ultimately correct interpretation.\n\n\fi\n\nt\n\nr\no\na\nu\ng\nb\nm\na\ns\nd\n\ni\n\n \nt\n\na\n\n \ns\ne\ns\ns\ne\nc\nc\nu\ns\n \n\ne\ns\nr\na\np\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\nNP/S\n\nNP/Z\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n0\n\n.\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\ni\n\nt\n\nr\no\na\nu\ng\nb\nm\na\ns\nd\n\ni\n\n \nt\n\na\n\n \ns\ne\ns\ns\ne\nc\nc\nu\ns\n \n\ne\ns\nr\na\np\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\nU\u2212S\nA\u2212S\nU\u2212L\nA\u2212L\n\n0\n\n100\n\n200\n\n300\n\n400\n\nNumber of particles\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n0\n\n.\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\nU\u2212S\nA\u2212S\nU\u2212L\nA\u2212L\n\n0\n\n100\n\n200\n\n300\n\n400\n\nNumber of particles\n\ng\nn\n\ni\nt\n\na\nr\n \ny\nt\nl\n\nu\nc\ni\nf\nf\ni\n\nd\n\n \n\nn\na\ne\nM\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n(a) Model results\n\nNP/S\nNP/Z\n\nU\u2212S\n\nA\u2212S\n\nU\u2212L\n\nA\u2212L\n\n(b) Behavioral results\n\nFigure 3: Frequency of irrevocable garden path in particle-\ufb01lter parser as a function of number of\nparticles, and mean empirical dif\ufb01culty rating, for NP/S and NP/Z sentences.\n\np = .039, F 2(1, 19) = 32.9, p < .0001, and the interaction between factors was signi\ufb01cant,\nF 1(1, 31) = 8.28, p = .007, F 2(1, 19) = 5.56, p = .029. In the case of NP/Z sentences there\nwas a main effect of ambiguity, F 1(1, 31) = 63.6, p < .0001, F 2(1, 19) = 150.9, p < .0001 and\nlength, F 1(1, 31) = 127.2, p < .0001, F 2(1, 19) = 124.7, p < .0001 and the interaction between\nfactors was signi\ufb01cant by subjects only, F 1(1, 31) = 4.6, p = .04, F 2(1, 19) = 1.6, p = .2. The\nexperiment thus bore out most of the model\u2019s predictions, with ambiguity and length combining to\nmake sentence processing more dif\ufb01cult. One reason that our model may underestimate the effect\nof subject length on ease of understanding, at least in the NP/Z case, is the tendency of subject NPs\nto be short in English, which was not captured in the grammar used by the model.\n\n6 Conclusion and Future Work\n\nIn this paper we have presented a new incremental parsing algorithm based on the particle \ufb01lter\nand shown that it provides a useful foundation for modeling the effect of memory limitations in\nhuman sentence comprehension, including a novel solution to the problem posed by \u201cdigging-in\u201d\neffects [15] for rational models. In closing, we point out two issues \u2013 both involving the problem\nof resampling prominent in particle \ufb01lter research \u2013 in which we believe future research may help\ndeepen our understanding of language processing.\nThe \ufb01rst issue involves the question of when to resample. In this paper, we have take the approach of\ngenerating values of zn\u22121 from which to draw P (zn |zn\u22121 , x1 ...n\u22121 ) by sampling with replacement\n(i.e., resampling) after every word from the multinomial over P (zn\u22121 |x1 ...n\u22121 ) represented by the\nweighted particles. This approach has the problem that particle diversity can be lost rapidly, as it\ndecreases monotonically with the number of observations. Another option is to resample only when\nthe variance in particle weights exceeds a prede\ufb01ned threshold, sampling without replacement when\nthis variance is low [22]. As Figure 2 shows, a word that resolves a garden-path generally creates\nhigh weight variance. Our preliminary investigations indicate that associating variance-sensitive\nresampling with processing dif\ufb01culty leads to qualitatively similar predictions to the total parse\nfailure approach taken in Section 5, but further investigation is required.\nThe other issue involves how to resample. Since particle diversity can never increase, when parts\nof the space of possible T are missed by chance early on, they can never be recovered. As a conse-\nquence, applications of the particle \ufb01lter in machine learning and statistics tend to supplement the\nbasic algorithm with additional steps such as running Markov chain Monte Carlo on the particles\nin order to re-introduce diversity (e.g., [28]). Further work would be required, however, to spec-\nify an MCMC algorithm over trees given an input pre\ufb01x. Both of these issues may help achieve\na deeper understanding of the details of reanalysis in garden-path recovery [29]. For example, the\ninitial reaction of many readers to the sentence The horse raced past the barn fell is to wonder what\na \u201cbarn fell\u201d is. With variance-sensitive resampling, this observation could be handled by smoothing\nthe probabilistic grammar; with diversity-introducing MCMC, it might be handled by tree-changing\noperations chosen during reanalysis.\n\n\fAcknowledgments\n\nRL would like to thank Klinton Bicknell and Gabriel Doyle for useful comments and suggestions.\nFR and TLG were supported by grants BCS-0631518 and BCS-070434 from the National Science\nFoundation.\n\nReferences\n\n[1] C. D. Manning and H. Sch\u00a8utze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.\n[2] D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Pro-\n\ncessing, Computational Linguistics, and Speech Recognition. Prentice-Hall, second edition, 2008.\n\n[3] D. Jurafsky. Probabilistic modeling in psycholinguistics: Linguistic comprehension and production. In Rens\n\nBod, Jennifer Hay, and Stefanie Jannedy, editors, Probabilistic Linguistics, pages 39\u201395. MIT Press, 2003.\n\n[4] M. K. Tanenhaus, M. J. Spivey-Knowlton, K. Eberhard, and J. C. Sedivy. Integration of visual and linguistic\n\ninformation in spoken language comprehension. Science, 268:1632\u20131634, 1995.\n\n[5] G. T. Altmann and Y. Kamide. Incremental interpretation at verbs: restricting the domain of subsequent refer-\n\nence. Cognition, 73(3):247\u2013264, 1999.\n\n[6] D. Jurafsky. A probabilistic model of lexical and syntactic access and disambiguation. Cognitive Science,\n\n20(2):137\u2013194, 1996.\n\n[7] N. Chater, M. Crocker, and M. Pickering. The rational analysis of inquiry: The case for parsing. In M. Oaksford\n\nand N. Chater, editors, Rational models of cognition. Oxford, 1998.\n\n[8] J. Hale. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of NAACL, volume 2, pages\n\n159\u2013166, 2001.\n\n[9] R. Levy. Expectation-based syntactic comprehension. Cognition, 106:1126\u20131177, 2008.\n[10] J. Earley. An ef\ufb01cient context-free parsing algorithm. Communications of the ACM, 13(2):94\u2013102, 1970.\n[11] A. Stolcke. An ef\ufb01cient probabilistic context-free parsing algorithm that computes pre\ufb01x probabilities. Com-\n\nputational Linguistics, 21(2):165\u2013201, 1995.\n\n[12] M.-J. Nederhof. The computational complexity of the correct-pre\ufb01x property for TAGs. Computational Lin-\n\nguistics, 25(3):345\u2013360, 1999.\n\n[13] L. Huang and D. Chiang. Better k-best parsing.\n\nTechnologies, 2005.\n\nIn Proceedings of the International Workshop on Parsing\n\n[14] M. Johnson, T. L. Grif\ufb01ths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo.\nIn Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the\nAssociation for Computational Linguistics, 2007.\n\n[15] W. Tabor and S. Hutchins. Evidence for self-organized sentence processing: Digging in effects. Journal of\n\nExperimental Psychology: Learning, Memory, and Cognition,, 30(2):431\u2013450, 2004.\n\n[16] B. Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249\u2013276,\n\n2001.\n\n[17] M. Collins and B. Roark. Incremental parsing with the perceptron algorithm. In Proceedings of the ACL, 2004.\n[18] J. Henderson. Lookahead in deterministic left-corner parsing. In Proceedings of the Workshop on Incremental\n\nParsing: Bringing Engineering and Cognition Together, 2004.\n\n[19] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer, 2001.\n[20] A. N. Sanborn, T. L. Grif\ufb01ths, and D. J. Navarro. A more rational model of categorization. In Proceedings of\n\nthe 28th Annual Conference of the Cognitive Science Society, Mahwah, NJ, 2006. Erlbaum.\n\n[21] N. Daw and A. Courville. The pigeon as particle \ufb01lter. In Advances in Neural Information Processing Systems\n\n20, Cambridge, MA, 2008. MIT Press.\n\n[22] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle \ufb01ltering for dynamic Bayesian\n\nnetworks. In Advances in Neural Information Processing Systems, 2000.\n\n[23] N. Smith and R. Levy. Optimal processing times in reading: a formal model and empirical investigation. In\n\nProceedings of the 30th Annual Meeting of the Cognitive Science Society, 2008.\n\n[24] M. C. MacDonald. Probabilistic constraints and syntactic ambiguity resolution. Language and Cognitive\n\nProcesses, 9(2):157\u2013201, 1994.\n\n[25] M. J. Spivey and M. K. Tanenhaus. Syntactic ambiguity resolution in discourse: Modeling the effects of refer-\nential content and lexical frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition,\n24(6):1521\u20131543, 1998.\n\n[26] L. Frazier and K. Rayner. Making and correcting errors during sentence comprehension: Eye movements in\n\nthe analysis of structurally ambiguous sentences. Cognitive Psychology, 14:178\u2013210, 1982.\n\n[27] F. Ferreira and J. M. Henderson. Recovery from misanalyses of garden-path sentences. Journal of Memory and\n\nLanguage, 31:725\u2013745, 1991.\n\n[28] N. Chopin. A sequential particle \ufb01lter method for static models. Biometrika, 89:539\u2013552, 2002.\n[29] P. Sturt, M. J. Pickering, and M. W. Crocker. Structural change and reanalysis dif\ufb01culty in language compre-\n\nhension. Journal of Memory and Language, 40:143\u2013150, 1999.\n\n\f", "award": [], "sourceid": 379, "authors": [{"given_name": "Roger", "family_name": "Levy", "institution": null}, {"given_name": "Florencia", "family_name": "Reali", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}