{"title": "A Holistic Approach to Compositional Semantics: a connectionist model and robot experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": "", "full_text": "A Holistic Approach\n\nto Compositional Semantics:\n\na connectionist model and robot experiments\n\nYuuya Sugita\nBSI, RIKEN\n\nHirosawa 2-1, Wako-shi\nSaitama 3510198 JAPAN\n\nsugita@bdc.brain.riken.go.jp\n\nJun Tani\n\nBSI, RIKEN\n\nHirosawa 2-1, Wako-shi\nSaitama 3510198 JAPAN\ntani@bdc.brain.riken.go.jp\n\nAbstract\n\nWe present a novel connectionist model for acquiring the semantics of a\nsimple language through the behavioral experiences of a real robot. We\nfocus on the \u201ccompositionality\u201d of semantics, a fundamental character-\nistic of human language, which is the ability to understand the meaning\nof a sentence as a combination of the meanings of words. We also pay\nmuch attention to the \u201cembodiment\u201d of a robot, which means that the\nrobot should acquire semantics which matches its body, or sensory-motor\nsystem. The essential claim is that an embodied compositional semantic\nrepresentation can be self-organized from generalized correspondences\nbetween sentences and behavioral patterns. This claim is examined and\ncon\ufb01rmed through simple experiments in which a robot generates corre-\nsponding behaviors from unlearned sentences by analogy with the corre-\nspondences between learned sentences and behaviors.\n\n1 Introduction\nImplementing language acquisition systems is one of the most di\ufb03cult problems, since\nnot only the complexity of the syntactical structure, but also the diversity in the domain\nof meaning make this problem complicated and intractable. In particular, how linguistic\nmeaning can be represented in the system is crucial, and this problem has been investigated\nfor many years.\n\nIn this paper, we introduce a connectionist model to acquire the semantics of language with\nrespect to the behavioral patterns of a real robot. An essential question is how embod-\nied compositional semantics can be acquired in the proposed connectionist model without\nproviding any representations of the meaning of a word or behavior routines a priori. By\n\u201ccompositionality\u201d, we refer to the fundamental human ability to understand a sentence\nfrom (1) the meanings of its constituents, and (2) the way in which they are put together.\nIt is possible for a language acquisition system that acquires compositional semantics to\nderive the meaning of an unknown sentence from the meanings of known sentences. Con-\nsider the unknown sentence: \u201cJohn likes birds.\u201d It could be understood by learning these\nthree sentences: \u201cJohn likes cats.\u201d; \u201cMary likes birds.\u201d; and \u201cMary likes cats.\u201d That is to\nsay, generalization of meaning can be achieved through compositional semantics.\n\nFrom the point of view of compositionality, the symbolic representation of word meaning\nhas much a\ufb03nity with processing the linguistic meaning of sentences [4]. Following this\nobservation, various learning models have been proposed to acquire the embodied seman-\n\n\ftics of language. For example, some models learn semantics in the form of correspondences\nbetween sentences and non-linguistic objects, i.e., visual images [10] or the sensory-motor\npatterns of a robot [7, 13].\n\nIn these works, the syntactic aspect of language was acquired through a pre-acquired lex-\nicon. This means that the meanings of words (i.e., lexicon) is acquired independently of\nthe usages of words in sentences (i.e., syntax). Although this separated learning approach\nseems to be plausible from the requirements of compositionality, it causes inevitable dif-\n\ufb01culties in representing the meaning of a sentence. A priori separation of lexicon and\nsyntax requires a pre-de\ufb01ned manner of combining word meanings into the meaning of a\nsentence. In Iwahashi\u2019s model, the class of a word is assumed to be given prior to learn-\ning its meaning because di\ufb00erent acquisition algorithms are required for nouns and verbs\n(c.f., [12]). Moreover, the meaning of a sentence is obtained by \ufb01lling a pre-de\ufb01ned tem-\nplate with meanings of words. Roy\u2019s model does not require a priori knowledge of word\nclasses, but requires the strong assumption, that the meaning of a word can be assigned\nto some pre-de\ufb01ned attributes of non-linguistic objects. This assumption is not realistic\nin more complex cases, such as when the meaning of a word needs to be extracted from\nnon-linguistic spatio-temporal patterns, as in case of learning verbs.\n\nIn this paper, we discuss an essential mechanism for self-organizing embodied composi-\ntional semantic representations, in which separate treatments of words and syntax are not\nrequired. Our model implements compositional semantics by utilizing the generalization\ncapability of an RNN, where the meaning of each word cannot exist independently, but\nemerges from the relations with others (c.f., reverse compositionality, [3]). In this situa-\ntion, a sort of generalization can be expected, such that the meanings of novel sentences\ncan be inferred by analogy with learned ones.\n\nThe experiments were conducted using a real mobile robot with an arm and with various\nsensors, including a vision system. A \ufb01nite set of two-word sentences consisting of a verb\nfollowed by a noun was considered. Our analysis will clarify what sorts of internal neural\nstructures should be self-organized for achieving compositional semantics grounded to a\nrobot\u2019s behavioral experiences. Although our experimental design is limited, the current\nstudy will suggest an essential mechanism for acquiring grounded compositional seman-\ntics, with the minimal combinatorial structure of this \ufb01nite language [2].\n\n2 Task Design\nThe aim of our experimental task is to discuss an essential mechanism for self-organizing\ncompositional semantics based on the behavior of a robot. In the training phase, our robot\nlearns the relationships between sentences and the corresponding behavioral sensory-motor\nsequences of a robot in a supervised manner. It is then tested to generate behavioral se-\nquences from a given sentence. We regard compositional semantics as being acquired if\nappropriate behavioral sequences can be generated from unlearned sentences by analogy\nwith learned data.\n\nOur mobile robot has three actuators, with two wheels and a joint on the arm; a colored\nvision sensor; and two torque sensors, on the wheel and the arm (Figure 1a). The robot\noperates in an environment where three colored objects (red, blue, and green) are placed\non the \ufb02oor (Figure 1b). The positions of these objects can be varied so long as the robot\nsees the red object on the left side of its \ufb01eld of view, the green object in the middle, and\nthe blue object on the right at the start of every trial of behavioral sequences. The robot\nthus learns nine categories of behavioral patterns, consisting of pointing at, pushing, and\nhitting each of the three objects, in a supervised manner. These categories are denoted as\nPOINT-R, POINT-B, POINT-G, PUSH-R, PUSH-B, PUSH-G, HIT-R, HIT-B, and HIT-G\n(Figure 1c-e).\n\nThe robot also learns sentences which consist of one of 3 verbs (point, push, hit) fol-\n\n\fStarting  Position\n\nPOINT-G\n\nPUSH-G\n\nHIT-G\n\nBlue \n\nGreen \n\nRed\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n\"point red\"\n\"point left\"\n\nPOINT-R\n\n\"push red\"\n\"push left\"\n\n\"point blue\"\n\"point center\"\n\n\"point green\"\n\"point right\"\n\nPOINT-B\n\nPOINT-G\n\n\"push blue\"\n\"push center\"\n\n\"push green\"\n\"push right\"\n\nPUSH-R\n\nPUSH-B\n\nPUSH-G\n\n\"hit red\"\n\"hit left\"\n\n\"hit blue\"\n\"hit center\"\n\n\"hit green\"\n\"hit right\"\n\nHIT-R\n\nHIT-B\n\nHIT-G\n\nFigure 1: The mobile robot (a) starts from a\n\ufb01xed position in the environment and (b) ends\neach behavior by (c) pointing at, (d) pushing,\nor (e) hitting an object.\n\nFigure 2: The correspondence between\nsentences and behavioral categories. Each\nbehavioral category has two corresponding\nsentences.\n\nlowed by one of 6 nouns (red, left, blue, center, green, right). The meanings of\nthese 18 possible sentences are given in terms of \ufb01xed correspondences with the 9 behav-\nioral categories (Figure 2). For example, \u201cpoint red\u201d and \u201cpoint left\u201d correspond to\nPOINT-R, \u201cpoint blue\u201d and \u201cpoint center\u201d to POINT-B, and so on. In these corre-\nspondences, \u201cleft,\u201d \u201ccenter,\u201d and \u201cright\u201d have exactly the same meaning as \u201cred,\u201d\n\u201cblue,\u201d and \u201cgreen\u201d respectively. These synonyms are introduced to observe how the\nbehavioral similarity a\ufb00ects the acquired linguistic semantic structure.\n\n3 Proposed Model\nOur model employs two RNNs with parametric bias nodes (RNNPBs) [15] in order to\nimplement a linguistic module and a behavioral module (Figure 3). The RNNPB, like the\nconventional Jordan-type RNN [8], is a connectionist model to learn time sequences. The\nlinguistic module learns the above sentences represented as time sequences of words [1],\nwhile the behavioral module learns the behavioral sensory-motor sequences of the robot.\nTo acquire the correspondences between the sentences and behavioral sequences, these two\nmodules are connected to each other by using the parametric bias binding method. Before\ndiscussing this binding method in detail, we introduce the overall architecture of RNNPB.\n\nword prediction \n\noutput nodes\n\nLinguistic Module\n\nsensory-motor\nprediction output\n\nnodes\n\nBehavioral Module\n\nword input\n\nparametric bias\n\nnodes \n\n nodes\n\ncontext nodes\n\nsensory-motor\n input nodes\n\nparametric bias\n\n nodes\n\ncontext nodes\n\nInteraction via parametric binding method\n\nFigure 3: Our model is composed of two RNNs with parametric bias nodes (RNNPBs),\none for a linguistic module and the other for a behavioral module. Both modules interact\nwith each other during the learning process via the parametric bias method introduced in\nthe text.\n\n3.1 RNNPB\nThe RNNPB has the same neural architecture as the Jordan-type RNN except for the PB\nnodes in the input layer (c.f., each module of Figure 3). Unlike the other input nodes, these\nPB nodes take a speci\ufb01c constant vector throughout each time sequence, and are employed\nto implement a mapping between \ufb01xed-length vectors and time sequences.\n\nLike the conventional Jordan-type RNN, the RNNPB learns time sequences in a supervised\nmanner. The di\ufb00erence is that in the RNNPB, the vectors that encode the time sequences\nare self-organized in PB nodes during the learning process. The common structural proper-\nties of all the training time sequences are acquired as connection weight values by using the\n\n\fback-propagation through time (BPTT) algorithm, as used also in the conventional RNN\n[8, 11]. Meanwhile, the speci\ufb01c properties of each individual time sequence are simultane-\nously encoded as PB vectors (c.f., [9]). As a result, the RNNPB self-organizes a mapping\nbetween the PB vectors and the time sequences.\nThe learning algorithm for the PB vectors is a variant of the BPTT algorithm. For each of n\ntraining time sequences of real-numbered vectors x0,\u00b7\u00b7\u00b7 , xn\u22121, the back-propagated errors\nwith respect to the PB nodes are accumulated for all time steps to update the PB vectors.\nFormally, the update rule for the PB vector pxi encoding the i-th training time sequence xi\nis given as follows:\n\nli\u22121(cid:1)\n\n1\nli\n\n(t)\nerrorpxi\n+ \u03b7 \u00b7 \u03b4pold\n\nxi\n\n\u03b42 pxi\n\n=\n\nt=0\n\n= \u0001 \u00b7 \u03b42 pxi\n= pold\n+ \u03b4pxi\nxi\n\n\u03b4pxi\npxi\n\n(1)\n\n(2)\n(3)\n\nIn equation (1), the update of PB vector \u03b42 pxi is obtained from the average back-propagated\n(t) through all time steps from t = 0 tol i \u2212 1, where\nerror with respect to a PB node errorpxi\nli is the length of xi. In equation (2), this update is low-pass \ufb01ltered to inhibit frequent rapid\nchanges in the PB vectors.\n\nAfter successfully learning the time sequences, the RNNPB can generate a time sequence\nxi from its corresponding PB vector pxi. The actual generation process of a time sequence\nxi is implemented by iteratively utilizing the RNNPB with the corresponding PB vector\npxi, a \ufb01xed initial context vector, and input vectors for each time step. Depending on the\nrequired functionality, both the external information (e.g., sensory information) and the\ninternal prediction (e.g., motor commands) are employed as input vectors.\n\nHere, we introduce an abstracted operational notation for the RNNPB to facilitate a later\nexplanation of our proposed method of binding language and behavior. By using an opera-\ntor RNNPB, the generation of xi from pxi is described as follows:\ni = 0,\u00b7\u00b7\u00b7 , n \u2212 1.\n\nRNNPB(pxi) \u2192 xi,\n\n(4)\n\nFurthermore, the RNNPB can be used not only for sequence generation processes but also\nfor recognition processes. For a given sequence xi, the corresponding PB vector pxi can\nbe obtained by using the update rules for the PB vectors (equations (1) to (3)), without\nupdating the connection weight values. This inverse operation for generation is regarded\nas recognition, and is hence denoted as follows:\n\u22121(xi) \u2192 pxi\n,\n\ni = 0,\u00b7\u00b7\u00b7 , n \u2212 1.\n\nRNNPB\n\n(5)\n\nThe other important characteristic nature of the RNNPB is that the relational structure\namong the training time sequences can be acquired in the PB space through the learning\nprocess. This generalization capability of RNNPB can be employed to generate and rec-\nognize unseen time sequences without any additional learning. For instance, by learning\nseveral cyclic time sequences of di\ufb00erent frequency, novel time sequences of intermediate\nfrequency can be generated [6].\n3.2 Binding\nIn the proposed model, corresponding sentences and behavioral sequences are constrained\nto have the same PB vectors in both modules. Under this condition, corresponding be-\nhavioral sequences can be generated naturally from sentences. When a sentence si and its\ncorresponding behavioral sequence bi have the same PB vector, we can obtain bi from si\nas follows:\n\nRNNPBB(RNNPB\n\nL (si)) \u2192 bi\n\u22121\n\n(6)\n\n\f+ \u03b3L \u00b7 (pold\n+ \u03b3B \u00b7 (pold\n\nbi\n\nsi\n\nbi\n\n(7)\n\npsi\npbi\n\n= pold\nsi\n= pold\nbi\n\nwhere RNNPBL and RNNPBB are abstracted operators for the linguistic module and the\nbehavioral module, respectively.\nThe PB vector psi is obtained by recognizing the sentence si. Because of the constraint\nthat corresponding sentences and behavioral sequences must have the same PB vectors,\npbi is equal to psi. Therefore, we can obtain the corresponding behavioral sequence bi by\nutilizing the behavioral module with pbi.\nThe binding constraint is implemented by introducing an interaction term into part of the\nupdate rule for the PB vectors (equation (3)).\n+ \u03b4psi\n+ \u03b4pbi\n\n(8)\nwhere \u03b3L and \u03b3B are positive coe\ufb03cients that determine the strength of the binding. Equa-\ntions (7) and (8) are the constrained update rules for the linguistic module and the behavior\nmodule, respectively. Under these rules, the PB vectors of a corresponding sentence si and\nbehavioral sequence bi attract each other. Actually, the corresponding PB vectors psi and\npbi need not be completely equalized to learn a correspondence. The epsilon errors of the\nPB vectors can be neglected because of the continuity of PB spaces.\n3.3 Generalization of Correspondences\nAs noted above, our model enables a robot to understand a sentence by means of a gen-\nerated behavior as if the meaning of the sentence were composed of the meanings of the\nconstituents. That is to say, the robot can generate appropriate behavioral sequences from\nall sentences without learning all correspondences. To achieve this, an unlearned sentence\nand its corresponding behavioral sequences must have the same PB vector. Nevertheless,\nthe PB binding method only equalizes the PB vectors for given corresponding sentences\nand behavioral sequences (c.f., equation (7) and (8)).\n\n\u2212 pold\nsi )\n\u2212 pold\n)\n\nImplicit binding, or in other words, inter-module generalization of correspondences, is\nachieved by dynamic coordination between the PB binding method and the intra-module\ngeneralization of each module. The local e\ufb00ect of the PB binding method spreads over\nthe whole PB space, because each individual PB vector depends on the others in order to\nself-organize PB structures re\ufb02ecting the relationships among training data. Thus, the PB\nstructures of both modules densely interact via the PB binding methods. Finally, both PB\nstructures converge into a common PB structure, and therefore, all corresponding sentences\nand behavioral sequences then share the same PB vectors automatically.\n\n4 Experiments\nIn the learning phase, the robot learned 14 of 18 correspondences between sentences and\nbehavioral patterns (c.f., Figure 2). It was then tested to generate behavioral sequences\nfrom each of the remaining 4 sentences (\u201cpoint green\u201d, \u201cpoint right\u201d, \u201cpush red\u201d,\nand \u201cpush left\u201d).\n\nTo enable a robot to learn correspondences robustly, \ufb01ve corresponding sentences and be-\nhavioral sequences were associated by using the PB binding method for each of the 14\ntraining correspondences. Thus, the linguistic module learned 70 sentences with PB bind-\ning. Meanwhile, the behavioral module learned the behavioral sequences of the 9 cate-\ngories, including 2 categories which had no corresponding sentences in the training set.\nThe behavioral module learned 10 di\ufb00erent sensory-motor sequences for each behavioral\ncategory. It therefore learned 70 behavioral sequences corresponding to the training sen-\ntences with PB binding and the remaining 20 sequences independently. In addition, the\nbehavioral module learned the same 90 behavioral sequences without binding.\n\nA sentence is represented as a time sequence of words, which starts with a \ufb01xed starting\nsymbol. Each word is locally represented, such that each input node of the module corre-\n\n\fsponds to a speci\ufb01c word. A single input node takes a value of 1.0 while the others take\n0.0 [1]. The linguistic module has 10 input nodes for each of 9 words and a starting sym-\nbol. The module also has 6 parametric bias nodes, 4 context nodes, 50 hidden nodes, and\n10 prediction output nodes. Thus, no a priori knowledge about the meanings of words is\npre-programmed.\n\nA training behavioral sequence was created by sampling three sensory-motor vectors per\nsecond during a trial of the robot\u2019s human-guided behavior. For robust learning of behavior,\neach training behavioral sequence was generated under a slightly di\ufb00erent environment in\nwhich object positions were varied. The variation was at most 20 percent of the distance\nbetween the starting position of the robot and the original position of each object in every\ndirection (c.f., Figure 1b). Typical behavioral sequences are about 5 to 25 seconds long,\nand therefore have about 15 to 75 sensory-motor vectors. A sensory-motor vector is a real-\nnumbered 26-dimensional vector consisting of 3 motor values (for 2 wheels and the arm), 2\nvalues from torque sensors (of the wheels and the arm), and 21 values encoding the visual\nimage. The visual \ufb01eld is divided vertically into 7 regions, and each region is represented\nby (1) the fraction of the region covered by the object, (2) the dominant hue of the object\nin the region, and (3) the bottom border of the object in the region, which is proportional\nto the distance of the object from the camera. The behavioral module had 26 input nodes\nfor sensory-motor input, 6 parametric bias nodes, 6 context nodes, 70 hidden nodes, and 6\noutput nodes for motor commands and partial prediction of the sensory image at the next\ntime step.\n\n5 Results and Analysis\nIn this section, we analyze the results of the experiment presented in the previous section.\nThe analysis reveals that the inter-module generalization realized by the PB binding method\ncould \ufb01ll an essential role in self-organizing the compositional semantics of the simple\nlanguage through the behavioral experiences of the robot. As mentioned in the previous\nsection, the training data for this experiment did not include all the correspondences. As\na result, although the behavioral module was trained with the behavioral sequences of all\nbehavioral categories, those in two of the categories, whose corresponding sentences were\nnot in the linguistic training set, could not be bound.\n\nThe most important result was that these dangling behavioral sequences could be bound\nwith appropriate sentences. The robot could properly recognize four unseen sentences, and\ngenerate the corresponding behaviors. This means that both modules share the common\nPB structure successfully.\n\nComparing the PB spaces of both modules shows that they indeed shared a common struc-\nture as a result of binding. The linguistic PB vectors are computed by recognizing all\nthe possible 18 sentences including 4 unseen ones (Figure 4a), and the behavioral PB\nvectors are computed at the learning phase for all the corresponding 90 behavioral se-\nquences in the training data (Figure 4b). The acquired correspondences between sen-\ntences and behavioral sequences can be examined according to equation (6). In particu-\nlar, the implicit binding of the four unlearned correspondences (\u201cpoint green\u201d\u2194POINT-\nG, \u201cpoint right\u201d\u2194POINT-G, \u201cpush red\u201d\u2194PUSH-R, and \u201cpush left\u201d\u2194PUSH-R)\ndemonstrates acquisition of the underlying semantics, or the generalized correspondences.\n\nThe acquired common structure has two striking characteristics: (1) the combinatorial\nstructure originated from the linguistic module, and (2) the metric based on the behav-\nioral similarity originated from the behavioral module. The interaction between modules\nenabled both PB spaces to simultaneously acquire both of these two structural properties.\n\nWe can \ufb01nd three congruent sub-structures for each verb, and six congruent sub-structures\nfor each noun in the linguistic PB space. This congruency represents the underlying syn-\n\n\f0.8\n\nt\nn\ne\nn\no\np\nm\no\nc\n \nl\na\np\n\ni\nc\nn\n\n \n\ni\nr\np\nd\nn\no\nc\ne\ns\n \ne\nh\nT\n\n0.2\n\n0.2\n\npoint red\npoint left\npoint blue\npoint center\npoint green\npoint right\npush red\npush left\npush blue\npush center\npush green\npush right\nhit red\nhit left\nhit blue\nhit center\nhit green\nhit right\n\n0.8\n\nt\nn\ne\nn\no\np\nm\no\nc\n \nl\na\np\n\ni\nc\nn\n\n \n\ni\nr\np\nd\nn\no\nc\ne\ns\n \ne\nh\nT\n\n0.2\n\n0.2\n\nPOINT-R\n\nPOINT-B\n\nPOINT-G\n\nPUSH-R\n\nPUSH-B\n\nPUSH-G\n\nHIT-R\n\nHIT-B\n\nHIT-G\n\nThe first principal component\n\n0.8\n\n(b) Behavioral module\n\nThe first principal component\n\n0.8\n\n(a) Linguistic module\n\nFigure 4: Plots of the bound linguistic module (a) and the bound behavioral module (b).\nBoth plots are projections of the PB spaces onto the same surface determined by the PCA\nmethod. Here, the accumulated contribution rate is about 73%. Unlearned sentences and\ntheir corresponding behavioral categories are underlined.\n\ntax structure of training sentences. For example, it is possible to estimate the PB vector\nof \u201cpoint green\u201d from the relationship among the PB vectors of \u201cpoint blue\u201d, \u201chit\nblue\u201d and \u201chit green.\u201d This predictable geometric regularity could be acquired by inde-\npendent learning of the linguistic module. However it could not be acquired by independent\nlearning of the behavioral module because these behavioral sequences can not be decom-\nposed into plausible primitives, unlike the sentences which can be broken down into words.\n\nWe can also see a metric re\ufb02ecting the similarity of behavioral sequences not only in the\nbehavioral modules but also in the linguistic module. The PB vectors of sentences that\ncorrespond to the same behavioral category take the similar values. For example, the two\nsentences corresponding to POINT-R (\u201cpoint red\u201d and \u201cpoint left\u201d) are encoded in\nsimilar PB vectors. Such a metric nature could not be observed in the independent learning\nof the linguistic module, in which all nouns were plotted symmetrically in the PB space by\nmeans of the syntactical constraints.\n\nThe above observation thus con\ufb01rms that the embodied compositional semantics was self-\norganized through the uni\ufb01cation of both modules, which was implemented by the PB\nbinding method. We also made experiments with di\ufb00erent test sentences, and con\ufb01rmed\nthat similar results could be obtained.\n\n6 Discussion and Summary\nOur simple experiments showed that the minimal grounded compositional semantics of our\nlanguage can be acquired by generalizing the correspondences between sentences and the\nbehavioral sensory-motor sequences of a robot. Our experiments could not examine strong\nsystematicity [4], but could address the combinatorial characteristic nature of sentences.\nThat is to say, the robot could understand relatively simple sentences in a systematic way,\nand could understand novel sentences. Therefore, our results can elucidate some important\nissues about the compositional semantic representation.\n\nWe claim that the acquisition of word meaning and syntax can not be separated from the\nstandpoint of the symbol grounding problem [5]. The meanings of words depend on each\nother to compose the meanings of sentences [16]. Consider the meaning of the word \u201cred.\u201d\nThe meaning of \u201cred\u201d must be something which combines with the meaning of \u201cpoint\u201d,\n\u201cpush\u201d or \u201chit\u201d to form the grounded meanings of sentences. Therefore, a priori de\ufb01nition\nof the meaning of \u201cred\u201d substantially a\ufb00ects the organization of the other parts of the\nsystem, and often results in further pre-programming. This means that it is inevitably\ndi\ufb03cult to explicitly extract the meaning of a word from the meaning of a sentence.\n\n\fOur model avoids this di\ufb03culty by implementing the grounded meaning of a word implic-\nitly in terms of the relationships among the meanings of sentences based on behavioral\nexperiences. Our model does not require any pre-programming of syntactic information,\nsuch as symbolic representation of word meaning, a prede\ufb01ned combinatorial structure in\nthe semantic domain, or behavior routines. Instead, the essential structures accounting for\ncompositionality are fully self-organized in the iterative dynamics of the RNN, through the\nstructural interactions between language and behavior using the PB binding method. Thus,\nthe robot can understand \u201cred\u201d through its behavioral interactions in the designed tasks in\na bottom-up way [14]. A similar argument holds true for verbs. For example, the robot\nunderstands \u201cpoint\u201d through pointing at red, blue, and green objects.\n\nTo the summary, the current study has shown the importance of generalization of the cor-\nrespondences between sentences and behavioral patterns in the acquisition of an embodied\nlanguage. In future studies, we plan to apply our model to larger language sets. In the cur-\nrent experiment, the training set consists of a large fraction of the legal input space, when\ncompared with related works. Such a large training set is needed because our model has\nno a priori knowledge of syntax and composition rules. However, we think that our model\nrequires relatively fewer fraction of sentences to learn a larger language set, for a given\ndegree of syntactic complexity.\nReferences\n[1] J. L. Elman. Finding structure in time. Cognitive Science, 14:179\u2013211, 1990.\n[2] G. Evans. Semantic Theory and Tacit Knowledge. In S. Holzman and C. Leich, editors, Wittgen-\n\nstein: To Follow a Rule. London: Routledge and Kegan Paul, 1981.\n\n[3] J. Fodor. Why Compositionality Won\u2019t Go Away: Re\ufb02ections on Horwich\u2019s \u2019De\ufb02ationary\u2019\n\nTheory. Technical Report 46, Rutgers University, 1999.\n\n[4] R. F. Hadley. Systematicity revisited: reply to Christiansen and Chater and Niklasson and van\n\nGelder. Mind and Language, 9:431\u2013444, 1994.\n\n[5] S. Harnad. The symbol grounding problem. Physica D, 42:335\u2013346, 1990.\n[6] M. Ito and J. Tani. Generalization and Diversity in Dynamic Pattern Learning and Generation\nby Distributed Representation Architecture . Technical Report 3, Lab. for BDC, Brain Science\nInstitute, RIKEN, 2003.\n\n[7] N. Iwahashi. Language acquisition by robots \u2013 Towards New Paradigm of Language Processing\n\n\u2013. Journal of Japanese Society for Arti\ufb01cial Intelligence, 18(1):49\u201358, 2003.\n\n[8] M.I. Jordan and D.E. Rumelhart. Forward models: supervised learning with a distal teacher.\n\nCognitive Science, 16:307\u2013354, 1992.\n\n[9] R. Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Script s,\n\nLexicon, and Memory. MIT Press, 1993.\n\n[10] D. K. Roy. Learning visually grounded words and syntax for a scene description task. Computer\n\nSpeech and Language, 16, 2002.\n\n[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error\npropagation. In D. E. Rumelhart and J. L. Mclelland, editors, Parallel Distributed Processing.\nCambridge, MA: MIT Press, 1986.\n\n[12] J. M. Siskind. Grounding the Lexical Semantics of Verbs in Visual Perception using Force\n\nDynamics and Event Logic. Arti\ufb01cial Intelligence Research, 15:31\u201390, 2001.\n\n[13] L. Steels. The Emergence of Grammar in Communicating Autonomous Robotic Agents. In\nW. Horn, editor, Proceedings of European Conference of Arti\ufb01cial Intelligence, pages 764\u2013769.\nIOS Press, 2000.\n\n[14] J. Tani. Model-Based Learning for Mobile Robot Navigation from the Dynamical Systems\n\nPerspective. IEEE Trans. on SMC (B), 26(3):421\u2013436, 1996.\n\n[15] J. Tani. Learning to generate articulated behavior through the bottom-up and the top-down\n\ninteraction process. Neural Networks, 16:11\u201323, 2003.\n\n[16] T. Winograd. Understanding natural language. Cognitive Psychology, 3(1):1\u2013191, 1972.\n\n\f", "award": [], "sourceid": 2383, "authors": [{"given_name": "Yuuya", "family_name": "Sugita", "institution": null}, {"given_name": "Jun", "family_name": "Tani", "institution": null}]}