{"title": "A Novel Two-Step Method for Cross Language Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1259, "page_last": 1267, "abstract": "Cross language text classi\ufb01cation is an important learning task in natural language processing. A critical challenge of cross language learning lies in that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Speci\ufb01cally, we \ufb01rst formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed approach is evaluated by conducting a set of experiments with cross language sentiment classi\ufb01cation tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning approach outperforms a number of comparison cross language representation learning methods, especially when the number of parallel bilingual documents is small.", "full_text": "A Novel Two-Step Method for Cross Language\n\nRepresentation Learning\n\nMin Xiao and Yuhong Guo\n\nDepartment of Computer and Information Sciences\nTemple University, Philadelphia, PA 19122, USA\n\n{minxiao, yuhong}@temple.edu\n\nAbstract\n\nCross language text classi\ufb01cation is an important learning task in natural language\nprocessing. A critical challenge of cross language learning arises from the fact that\nwords of different languages are in disjoint feature spaces. In this paper, we pro-\npose a two-step representation learning method to bridge the feature spaces of dif-\nferent languages by exploiting a set of parallel bilingual documents. Speci\ufb01cally,\nwe \ufb01rst formulate a matrix completion problem to produce a complete parallel\ndocument-term matrix for all documents in two languages, and then induce a low\ndimensional cross-lingual document representation by applying latent semantic\nindexing on the obtained matrix. We use a projected gradient descent algorithm\nto solve the formulated matrix completion problem with convergence guarantees.\nThe proposed method is evaluated by conducting a set of experiments with cross\nlanguage sentiment classi\ufb01cation tasks on Amazon product reviews. The experi-\nmental results demonstrate that the proposed learning method outperforms a num-\nber of other cross language representation learning methods, especially when the\nnumber of parallel bilingual documents is small.\n\n1\n\nIntroduction\n\nCross language text classi\ufb01cation is an important natural language processing task that exploits a\nlarge amount of labeled documents in an auxiliary source language to train a classi\ufb01cation model for\nclassifying documents in a target language where labeled data is scarce. An effective cross language\nlearning system can greatly reduce the manual annotation effort in the target language for learning\ngood classi\ufb01cation models. Previous work in the literature has demonstrated successful performance\nof cross language learning systems on various cross language text classi\ufb01cation problems, including\nmultilingual document categorization [2], cross language \ufb01ne-grained genre classi\ufb01cation [14], and\ncross-lingual sentiment classi\ufb01cation [18, 16].\n\nThe challenge of cross language text classi\ufb01cation lies in the language barrier. That is documents\nin different languages are expressed with different word vocabularies and thus have disjoint feature\nspaces. A variety of methods have been proposed in the literature to address cross language text\nclassi\ufb01cation by bridging the cross language gap, including transforming the training or test data\nfrom one language domain into the other language domain by using machine translation tools or\nbilingual lexicons [18, 6, 23], and constructing cross-lingual representations by using readily avail-\nable auxiliary resources such as bilingual word pairs [16], comparable corpora [10, 20, 15], and\nother multilingual resources [3, 14].\n\nIn this paper, we propose a two-step learning method to induce cross-lingual feature representa-\ntions for cross language text classi\ufb01cation by exploiting a set of unlabeled parallel bilingual docu-\nments. First we construct a concatenated bilingual document-term matrix where each document is\nrepresented in the concatenated vocabulary of two languages. In such a matrix, a pair of parallel\n\n1\n\n\fdocuments are represented as a row vector \ufb01lled with observed word features from both the source\nlanguage domain and the target language domain, while a non-parallel document in a single lan-\nguage is represented as a row vector \ufb01lled with observed word features only from its own language\nand has missing values for the word features from the other language. We then learn the unobserved\nfeature entries of this sparse matrix by formulating a matrix completion problem and solving it us-\ning a projected gradient descent optimization algorithm. By doing so, we expect to automatically\ncapture important and robust low-rank information based on the word co-occurrence patterns ex-\npressed both within each language and across languages. Next we perform latent semantic indexing\nover the recovered document-term matrix and induce a low-dimensional dense cross-lingual repre-\nsentation of the documents, on which standard monolingual classi\ufb01ers can be applied. To evaluate\nthe effectiveness of the proposed learning method, we conduct a set of experiments with cross lan-\nguage sentiment classi\ufb01cation tasks on multilingual Amazon product reviews. The empirical results\nshow that the proposed method signi\ufb01cantly outperforms a number of cross language learning meth-\nods. Moreover, the proposed method produces good performance even with a very small number of\nunlabeled parallel bilingual documents.\n\n2 Related Work\n\nMany works in the literature address cross language text classi\ufb01cation by \ufb01rst translating documents\nfrom one language domain into the other one via machine translation tools or bilingual lexicons\nand then applying standard monolingual classi\ufb01cation algorithms [18, 23], domain adaptation tech-\nniques [17, 9, 21], or multi-view learning methods [22, 2, 1, 13, 12]. For example, [17] proposed\nan expectation-maximization based self-training method, which \ufb01rst initializes a monolingual clas-\nsi\ufb01er in the target language with the translated labeled documents from the source language and\nthen retrains the model by adding unlabeled documents from the target language with automatically\npredicted labels.\n[21] proposed an instance and feature bi-weighting method by \ufb01rst translating\ndocuments from one language domain to the other one and then simultaneously re-weighting in-\nstances and features to address the distribution difference across domains. [22] proposed to use\nthe co-training method for cross language sentiment classi\ufb01cation on parallel corpora.\n[2] pro-\nposed a multi-view majority voting method to categorize documents in multiple views produced\nfrom machine translation tools. [1] proposed a multi-view co-classi\ufb01cation method for multilingual\ndocument categorization, which minimizes both the training loss for each view and the prediction\ndisagreement between different language views. Our proposed approach in this paper shares similar-\nity with these approaches in exploiting parallel data produced by machine translation tools. But our\napproach only requires a small set of unlabeled parallel documents, while these approaches require\nat least translating all the training documents in one language domain.\n\nAnother important group of cross language text classi\ufb01cation methods in the literature con-\nstruct cross-lingual representations by exploiting bilingual word pairs [16, 7], parallel corpora\n[10, 20, 15, 19, 8], and other resources [3, 14].\n[16] proposed a cross-language structural cor-\nrespondence learning method to induce language-independent features by using pivot word pairs\nproduced by word translation oracles.\n[10] proposed a cross-language latent semantic indexing\n(CL-LSI) method to induce cross-lingual representations by performing LSI over a dual-language\ndocument-term matrix, where each dual-language document contains its original words and the\ncorresponding translation text. [20] proposed a cross-lingual kernel canonical correlation analysis\n(CL-KCCA) method. It \ufb01rst learns two projections (one for each language) by conducting kernel\ncanonical correlation analysis over a paired bilingual corpus and then uses them to project doc-\numents from language-speci\ufb01c feature spaces to the shared multilingual semantic feature space.\n[15] employed cross-lingual oriented principal component analysis (CL-OPCA) over concatenated\nparallel documents to learn a multilingual projection by simultaneously minimizing the projected\ndistance between parallel documents and maximizing the projected covariance of documents across\nlanguages. Some other work uses multilingual topic models such as the coupled probabilistic latent\nsemantic analysis and the bilingual latent Dirichlet allocation to extract latent cross-lingual topics\nas interlingual representations [19]. [14] proposed to use language-speci\ufb01c part-of-speech (POS)\ntaggers to tag each word and then map those language-speci\ufb01c POS tags to twelve universal POS\ntags as interlingual features for cross language \ufb01ne-grained genre classi\ufb01cation. Similar to the mul-\ntilingual semantic representation learning approaches such as CL-LSI, CL-KCCA and CL-OPCA,\nour two-step learning method exploits parallel documents. But different from these methods which\napply operations such as LSI, KCCA, and OPCA directly on the original concatenated document-\n\n2\n\n\fterm matrix, our method \ufb01rst \ufb01lls the missing entries of the document-term matrix using matrix\ncompletion, and then performs LSI over the recovered low-rank matrix.\n\n3 Approach\n\nIn this section, we present the proposed two-step learning method for learning cross-lingual docu-\nment representations. We assume a subset of unlabeled parallel documents from the two languages\nare given, which can be used to capture the co-occurrence of terms across languages and build con-\nnections between the vocabulary sets of the two languages. We \ufb01rst construct a uni\ufb01ed document-\nterm matrix for all documents from the auxiliary source language domain and the target language\ndomain, whose columns correspond to the word features from the uni\ufb01ed vocabulary set of the two\nlanguages. In this matrix, each pair of parallel documents is represented as a fully observed row\nvector, and each non-parallel document is represented as a partially observed row vector where only\nentries corresponding to words in its own language vocabulary are observed. Instead of learning a\nlow-dimensional cross-lingual document representation from this matrix directly, we perform a two-\nstep learning procedure: First we learn a low-rank document-term matrix by automatically \ufb01lling the\nmissing entries via matrix completion. Next we produce cross-lingual representations by applying\nthe latent semantic indexing method over the learned matrix.\n\nLet M 0 \u2208 Rt\u00d7d be the uni\ufb01ed document-term matrix, which is partially \ufb01lled with observed nonneg-\native feature values, where t is the number of documents and d is the size of the uni\ufb01ed vocabulary.\nWe use \u2126 to denote the index set of the observed features in M 0, such that (i, j) \u2208 \u2126 if only if M 0\nij\nis observed; and use b\u2126 to denote the index set of the missing features in M 0, such that (i, j) \u2208 b\u2126\nij is unobserved. For the i-th document in the data set from one language, if the doc-\nif only if M 0\nument does not have a parallel translation in the other language, then all the features in row M 0\ni:\ncorresponding to the words in the vocabulary of the other language are viewed as missing features.\n\n3.1 Matrix Completion\n\nNote that the document-term matrix M 0 has a large fraction of missing features and the only bridge\nbetween the vocabulary sets of the two languages is the small set of parallel bilingual documents.\nLearning from this partially observed matrix directly by treating missing features as zeros certainly\nwill lose a lot of information. On the other hand, a fully observed document-term matrix is naturally\nlow-rank and sparse, as the vocabulary set is typically very large and each document only contains\na small fraction of the words in the vocabulary. Thus we propose to automatically \ufb01ll the missing\nentries of M 0 based on the feature co-occurrence information expressed in the observed data, by\nconducting matrix completion to recover a low-rank and sparse matrix. Speci\ufb01cally, we formulate\nthe matrix completion as the following optimization problem\n\nmin\nM\n\nrank(M ) + \u00b5kM k1\n\nsubject to Mij = M 0\n\nij, \u2200(i, j) \u2208 \u2126; Mij \u2265 0, \u2200(i, j) \u2208 b\u2126\n\n(1)\n\nwhere k \u00b7 k1 denotes a \u21131 norm and is used to enforce sparsity. The rank function however is non-\nconvex and dif\ufb01cult to optimize. We can relax it to its convex envelope, a convex trace norm kM k\u2217.\nMoreover, instead of using the equality constraints in (1), we propose to minimize a regulariza-\nij), to cope with observation noise for all the observed feature entries.\ntion loss function, c(Mij, M 0\nMeanwhile, we also add regularization terms over the missing features, c(Mij, 0), \u2200(i, j) \u2208 b\u2126, to\navoid over\ufb01tting. In particular, we use the least squared loss function c(x, y) = 1\n2 kx \u2212 yk2. Hence\nwe obtain the following relaxed convex optimization problem for matrix completion\n\nmin\nM\n\n\u03b3kM k\u2217 + \u00b5kM k1 + X\n\nc(Mij, M 0\n\nij) + \u03c1 X\n\n(i,j)\u2208\u2126\n\n(i,j)\u2208 b\u2126\n\nc(Mij, 0)\n\nsubject to M \u2265 0\n\n(2)\n\nWith nonnegativity constraints M \u2265 0, the non-smooth \u21131 norm regularizer in the objective function\nof (2) is equivalent to a smooth linear function kM k1 = Pij Mij. Nevertheless, with the non-\nsmooth trace norm kM k\u2217, the optimization problem (2) remains to be convex but non-smooth.\nMoreover, the matrix M in cross-language learning tasks is typically very large, and thus a scalable\noptimization algorithm needs to be developed to conduct ef\ufb01cient optimization. In next section, we\nwill present a scalable projected gradient descent algorithm to solve this minimization problem.\n\n3\n\n\fAlgorithm 1 Projected Gradient Descent Algorithm\n\nInput: M 0, \u03b3, \u03c1 \u2264 1, 0 < \u03c4 < min(2, 2\nInitialize M as the nonnegative projection of the rank-1 approximation of M 0.\nwhile not converged do\n\n\u03c1 ), \u00b5.\n\n1. gradient descent: M = M \u2212 \u03c4 \u2207g(M ).\n2. shrink: M = S\u03c4 \u03b3(M ).\n3. project onto feasible set: M = max(M, 0).\n\nend while\n\n3.2 Latent Semantic Indexing\n\nAfter solving (2) for an optimal low-rank solution M \u2217, we can use each row of the sparse matrix\nM \u2217 as a vector representation for each document in the concatenated vocabulary space of the two\nlanguages. However exploiting such a matrix representation directly for cross language text clas-\nsi\ufb01cation lacks suf\ufb01cient capacity of handling feature noise and sparseness, as each document is\nrepresented using a very small set of words in the vocabulary set. We thus propose to apply a latent\nsemantic indexing (LSI) method on M \u2217 to produce a low-dimensional semantic representation of\nthe data. LSI uses singular value decomposition to discover the important associative relationships\nof word features [10], and create a reduced-dimension feature space. Speci\ufb01cally, we \ufb01rst perform\nsingular value decomposition over M \u2217, M \u2217 = U SV \u22a4, and then obtain a low dimensional represen-\ntation matrix Z via a projection Z = M \u2217Vk, where Vk contains the top k right singular vectors of\nM \u2217. Cross-language text classi\ufb01cation can then be conducted over Z using monolingual classi\ufb01ers.\n\n4 Optimization Algorithm\n\n4.1 Projected Gradient Descent Algorithm\n\nA number of algorithms have been developed to solve matrix completion problems in the litera-\nture [4, 11]. We use a projected gradient descent algorithm to solve the non-smooth convex opti-\nmization problem in (2). This algorithm takes the objective function f (M ) in (2) as a composition\nof a non-smooth term and a convex smooth term such as f (M ) = \u03b3kM k\u2217 + g(M ) where\n\ng(M ) = \u00b5kM k1 + X\n\nc(Mij, M 0\n\nij) + \u03c1 X\n\nc(Mij, 0).\n\n(3)\n\n(i,j)\u2208\u2126\n\n(i,j)\u2208 b\u2126\n\nIt \ufb01rst initializes M as the nonnegative projection of the rank-1 approximation of M 0, and then\niteratively updates M using a projected gradient descent procedure. In each iteration, we perform\nthree steps to update M. First, we take a gradient descent step M = M \u2212 \u03c4 \u2207g(M ) with stepsize \u03c4\nand gradient function\n\n\u2207g(M ) = \u00b5E + (M \u2212 M 0) \u25e6 Y + \u03c1M \u25e6 bY\n\n(4)\nwhere E is a t \u00d7 d matrix with all 1s; Y and bY are t \u00d7 d indicator matrices such that Yij = 1 if\nand only if (i, j) \u2208 \u2126 and bY = E \u2212 Y ; and \u201c\u25e6\u201d denotes the Hadamard product. Next we perform a\nshrinkage operation M = S\u03bd(M ) over the resulting matrix from the \ufb01rst step to minimize its rank.\nThe shrinkage operator is based on singular value decomposition\n\nS\u03bd(M ) = U \u03a3(\u03bd)V \u22a4, M = U \u03a3V \u22a4, \u03a3(\u03bd) = max(\u03a3 \u2212 \u03bd, 0),\n\n(5)\nwhere \u03bd = \u03c4 \u03b3. Finally we project the resulting matrix into the nonnegative feasible set by M =\nmax(M, 0). This update procedure provably converges to an optimal solution. The overall algorithm\nis given in Algorithm 1.\n\n4.2 Convergence Analysis\n\nLet h(\u00b7) = I(\u00b7) \u2212 \u03c4 \u2207g(\u00b7) be the gradient descent operator used in the gradient descent step, and\nlet PC(\u00b7) = max(\u00b7, 0) be the projection operator, while S\u03bd(\u00b7) is the shrinkage operator. Below we\nprove the convergence of the projected gradient descent algorithm.\n\n4\n\n\fLemma 1. Let E be a t\u00d7d matrix with all 1s, and Q = E \u2212\u03c4 (Y +\u03c1bY ). For \u03c4 \u2208 (0, min(2, 2\n\u03c1 )), the\noperator h(\u00b7) is non-expansive, i.e., for any M and M \u2032 \u2208 Rt\u00d7d, kh(M )\u2212h(M \u2032)kF \u2264 kM \u2212M \u2032kF .\nMoreover, kh(M ) \u2212 h(M \u2032)kF = kM \u2212 M \u2032kF if and only if h(M ) \u2212 h(M \u2032) = M \u2212 M \u2032.\n\nProof. Note that for \u03c4 \u2208 (0, min(2, 2\ngradient de\ufb01nition in (4), we have\n\nkh(M ) \u2212 h(M \u2032)kF = (cid:13)(cid:13)(M \u2212 M \u2032) \u25e6 QkF = (X\n\nij\n\n\u03c1 )), we have \u22121 < Qij < 1, \u2200(i, j). Then following the\n\n(Mij \u2212 M \u2032\n\nij)2Q2\n\nij)\n\n1\n\n2 \u2264 kM \u2212 M \u2032kF\n\nThe inequalities become equalities if only if h(M ) \u2212 h(M \u2032) = M \u2212 M \u2032.\nLemma 2. [11, Lemma 1] The shrinkage operator S\u03bd(\u00b7) is non-expansive, i.e., for any M and\nM \u2032 \u2208 Rt\u00d7d, kS\u03bd(M )\u2212S\u03bd(M \u2032)kF \u2264 kM \u2212M \u2032kF . Moreover, kS\u03bd(M )\u2212S\u03bd(M \u2032)kF = kM \u2212M \u2032kF\nif and only if S\u03bd(M ) \u2212 S\u03bd(M \u2032) = M \u2212 M \u2032.\nLemma 3. The projection operator PC(\u00b7) is non-expansive, i.e., kPC(M ) \u2212 PC(M \u2032)kF \u2264 kM \u2212\nM \u2032kF . Moreover, kPC(M )\u2212PC(M \u2032)kF = kM \u2212M \u2032kF if and only if PC(M )\u2212PC(M \u2032) = M \u2212M \u2032.\n\nProof. For any given entry index (i, j), there are four cases:\n\n\u2022 Case 1: Mij \u2265 0, M \u2032\n\n\u2022 Case 2: Mij \u2265 0, M \u2032\n\nij \u2265 0. We have (PC(Mij) \u2212 PC(M \u2032\nij < 0. We have (PC(Mij) \u2212 PC(M \u2032\n\n\u2022 Case 3: Mij < 0, M \u2032\n\n\u2022 Case 4: Mij < 0, M \u2032\n\nij \u2265 0. We have (PC(Mij) \u2212 PC(M \u2032\nij < 0. We have (PC(Mij) \u2212 PC(M \u2032\n\nij)2.\n\nij < (Mij \u2212 M \u2032\n\nij))2 = (Mij \u2212 M \u2032\nij))2 = M 2\nij))2 = M \u20322\nij))2 = 0 \u2264 (Mij \u2212 M \u2032\n\nij < (Mij \u2212 M \u2032\nij)2.\n\nij)2.\n\nij)2.\n\nTherefore,\n\nkPC(M ) \u2212 PC(M \u2032)kF = (cid:0)X\n\n(PC(Mij) \u2212 PC(M \u2032\n\nij))2(cid:1) 1\n\n2 \u2264 (cid:0)X\n\n(Mij \u2212 M \u2032\n\n2 = kM \u2212 M \u2032kF\n\nij)2(cid:1) 1\n\nij\n\nij\n\nand kPC(M ) \u2212 PC(M \u2032)kF = kM \u2212 M \u2032kF if only if PC(M ) \u2212 PC(M \u2032) = M \u2212 M \u2032.\nTheorem 1. The sequence {M k} generated by the projected gradient descent iterations in Algo-\nrithm 1 with 0 < \u03c4 < min(2, 2\n\n\u03c1 ) converges to M \u2217, which is an optimal solution of (2).\n\nProof. Since h(\u00b7), S\u03bd(\u00b7) and PC(\u00b7) are all non-expansive, the composite operator PC(S\u03bd(h(\u00b7))) is\nnon-expansive as well. This theorem can then be proved following [11, Theorem 4].\n\n5 Experiments\n\nIn this section, we evaluate the proposed two-step learning method by conducting extensive cross\nlanguage sentiment classi\ufb01cation experiments on multilingual Amazon product reviews.\n\n5.1 Experimental Setting\n\nDataset We used the multilingual Amazon product reviews dataset [16], which contains three\ncategories (Books (B), DVD (D), Music (M)) of product reviews in four different languages (English\n(E), French (F), German (G), Japanese (J)). For each category of the product reviews, there are 2000\npositive and 2000 negative English reviews, and 1000 positive and 1000 negative reviews for each\nof the other three languages. In addition, there are another 2000 unlabeled parallel reviews between\nEnglish and each of the other three languages. Each review is preprocessed into a unigram bag-of-\nword feature vector with TF-IDF values. We focused on cross-lingual learning between English and\nthe other three languages and constructed 18 cross language sentiment classi\ufb01cation tasks (EFB,\nFEB, EFD, FED, EFM, FEM, EGB, GEB, EGD, GED, EGM, GEM, EJB, JEB, EJD, JED, EJM,\nJEM), each for one combination of selected source language, target language and category. For\nexample, the task EFB uses English Books reviews as the source language data and uses French\nBooks reviews as the target language data.\n\n5\n\n\fTable 1: Average classi\ufb01cation accuracies (%) and standard deviations (%) over 10 runs for the 18\ncross language sentiment classi\ufb01cation tasks.\n\nTBOW\n\nTASK\n67.31\u00b10.96\nEFB\nFEB\n66.82\u00b10.43\n67.80\u00b10.94\nEFD\n66.15\u00b10.65\nFED\nEFM 67.84\u00b10.43\nFEM 66.08\u00b10.52\n67.23\u00b10.68\nEGB\nGEB\n67.16\u00b10.55\n66.79\u00b10.80\nEGD\n66.27\u00b10.69\nGED\nEGM 67.65\u00b10.45\nGEM 66.74\u00b10.55\n63.15\u00b10.69\nEJB\n66.85\u00b10.68\nJEB\nEJD\n65.47\u00b10.50\n66.42\u00b10.55\nJED\n67.62\u00b10.75\nEJM\nJEM\n66.51\u00b10.51\n\nCL-LSI\n\n79.56\u00b10.21\n76.66\u00b10.34\n77.82\u00b10.66\n76.61\u00b10.25\n75.39\u00b10.40\n76.33\u00b10.27\n77.59\u00b10.21\n77.64\u00b10.19\n79.22\u00b10.22\n77.78\u00b10.26\n73.81\u00b10.49\n77.28\u00b10.51\n72.68\u00b10.35\n74.63\u00b10.42\n72.55\u00b10.28\n75.18\u00b10.27\n73.44\u00b10.50\n72.38\u00b10.50\n\nCL-KCCA\n77.56\u00b10.14\n73.45\u00b10.13\n78.19\u00b10.09\n74.93\u00b10.07\n78.24\u00b10.12\n73.38\u00b10.12\n79.14\u00b10.12\n74.15\u00b10.09\n76.73\u00b10.10\n74.26\u00b10.08\n79.18\u00b10.05\n72.31\u00b10.08\n69.46\u00b10.11\n67.99\u00b10.18\n74.79\u00b10.11\n72.44\u00b10.16\n73.54\u00b10.11\n70.00\u00b10.18\n\nCL-OPCA\n76.55\u00b10.31\n74.43\u00b10.53\n70.54\u00b10.41\n72.49\u00b10.47\n73.69\u00b10.49\n73.46\u00b10.50\n74.72\u00b10.54\n74.78\u00b10.39\n74.59\u00b10.66\n74.83\u00b10.45\n74.45\u00b10.59\n74.15\u00b10.42\n71.41\u00b10.48\n73.41\u00b10.41\n71.84\u00b10.41\n75.42\u00b10.52\n74.96\u00b10.86\n72.64\u00b10.66\n\nTSL\n\n81.92\u00b10.20\n79.51\u00b10.21\n81.97\u00b10.33\n78.09\u00b10.32\n79.30\u00b10.30\n78.53\u00b10.46\n79.22\u00b10.31\n78.65\u00b10.23\n81.34\u00b10.24\n79.34\u00b10.23\n79.39\u00b10.39\n79.02\u00b10.34\n72.57\u00b10.52\n77.17\u00b10.36\n76.60\u00b10.49\n79.01\u00b10.50\n76.21\u00b10.40\n77.15\u00b10.58\n\nApproaches We compared the proposed two-step learning (TSL) method with the following four\nmethods: TBOW, CL-LSI, CL-OPCA and CL-KCCA. The Target Bag-Of-Word (TBOW) baseline\nmethod trains a supervised monolingual classi\ufb01er in the original bag-of-word feature space with the\nlabeled training data from the target language domain. The Cross-Lingual Latent Semantic Indexing\n(CL-LSI) method [10] and the Cross-Lingual Oriented Principal Component Analysis (CL-OPCA)\nmethod [15] \ufb01rst learn cross-lingual representations with all data from both language domains by\nperforming LSI or OPCA and then train a monolingual classi\ufb01er with labeled data from both lan-\nguage domains in the induced low-dimensional feature space. The Cross-Lingual Kernel Canonical\nComponent Analysis (CL-KCCA) method [20] \ufb01rst induces two language projections by using un-\nlabeled parallel data and then trains a monolingual classi\ufb01er on labeled data from both language\ndomains in the projected low-dimensional space. For all experiments, we used linear support vector\nmachine (SVM) as the monolingual classi\ufb01cation model. For implementation, we used the libsvm\npackage [5] with default parameter setting.\n\n5.2 Classi\ufb01cation Accuracy\n\nFor each of the 18 cross language sentiment classi\ufb01cation tasks, we used all documents from the two\nlanguages and the additional 2000 unlabeled parallel documents for representation learning. Then\nwe used all documents in the auxiliary source language and randomly chose 100 documents from\nthe target language as labeled data for classi\ufb01cation model training, and used the remaining data in\nthe target language as test data. For the proposed method, TSL, we set \u00b5 = 10\u22126 and \u03c4 = 1, chose\n\u03b3 value from {0.01, 0.1, 1, 10}, chose \u03c1 value from {10\u22125, 10\u22124, 10\u22123, 10\u22122, 10\u22121, 1}, and chose\nthe dimension k value from {20, 50, 100, 200, 500}. We used the \ufb01rst task EFB to perform model\nparameter selection by running the algorithm 3 times based on random selections of 100 labeled\ntarget training data. This gave us the following parameter setting: \u03b3 = 0.1, \u03c1 = 10\u22124, k = 50. We\nused the same procedure to select the dimensionality of the learned semantic representations for the\nother three approaches, CL-LSI, CL-OPCA and CL-KCCA, which produced k = 50 for CL-LSI\nand CL-OPCA, and k = 100 for CL-KCCA. We then used the selected model parameters for all\nthe 18 tasks and ran each experiment for 10 times based on random selections of 100 labeled target\ndocuments. The average classi\ufb01cation accuracies and standard deviations are reported in Table 1.\n\nWe can see that the proposed two-step learning method, TSL, outperforms all other four comparison\nmethods in general. The target baseline TBOW performs poorly on all the 18 tasks, which implies\nthat 100 labeled target training documents are far from enough to obtain a robust sentiment classi\ufb01er\n\n6\n\n\fEFB\n\nEFD\n\n \n\n \n\n80\n\ny\nc\na\nr\nu\nc\nc\nA\n\n75\n\n70\n\n65\n\n \n\n80\n\n75\n\n70\n\n65\n\n60\n\n \n\n \n\n72\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n58\n\n56\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n82\n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n \n\n80\n\n75\n\n70\n\n65\n\n \n\n76\n\n74\n\n72\n\n70\n\n68\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEGB\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEJB\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEGD\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEJD\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n64\n\n \n\n80\n\n75\n\n70\n\n65\n\n60\n\n \n\n \n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\nEFM\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEGM\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nEJM\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\nFigure 1: Average test classi\ufb01cation accuracies (%) and standard deviations (%) over 10 runs with\ndifferent numbers of unlabeled parallel documents for adapting a classi\ufb01cation system from English\nto French, German and Japanese.\n\nin the target language domain. All the other three cross-lingual representation learning methods,\nCL-LSI, CL-KCCA and CL-OPCA, consistently outperform this baseline method across all the\n18 tasks, which demonstrates that the labeled training data from the source language domain is\nuseful for classifying the target language data under a uni\ufb01ed data representation. Nevertheless, the\nimprovements achieved by these three methods over the baseline are much smaller than the proposed\nTSL method. Across all the 18 tasks, TSL increases the average test accuracy over the baseline\nTBOW method by at least 8.59 (%) on the EJM task and up to 14.61 (%) on the EFB task. Moreover,\nTSL also outperforms both CL-KCCA and CL-OPCA across all the 18 tasks, outperforms CL-LSI\non 17 out of the 18 tasks and achieves comparable performance with CL-LSI on the remaining\none task (EJB). All these results demonstrate the ef\ufb01cacy and robustness of the proposed two-step\nrepresentation learning method for cross language text classi\ufb01cation.\n\n5.3\n\nImpact of the Size of Unlabeled Parallel Data\n\nAll the four cross-lingual adaptation learning methods, CL-LSI, CL-KCCA, CL-OPCA and TSL,\nexploit unlabeled parallel reviews for learning cross-lingual representations. Next we investigated\nthe performance of these methods with respect to different numbers of unlabeled parallel reviews.\nWe tested a set of different numbers, np \u2208 {200, 500, 1000, 2000}. For each number np in the set,\nwe randomly chose np parallel documents from all the 2000 unlabeled parallel reviews to conduct\nexperiments using the same setting from the previous experiments. Each experiment was repeated\n10 times based on random selections of labeled target training data. The average test classi\ufb01cation\naccuracies and standard deviations are plotted in Figure 1 and Figure 2. Figure 1 presents the results\nfor the 9 cross-lingual classi\ufb01cation tasks that adapt classi\ufb01cation systems from English to French,\nGerman and Japanese, while Figure 2 presents the results for the other 9 cross-lingual classi\ufb01cation\ntasks that adapt classi\ufb01cation systems from French, German and Japanese to English.\n\n7\n\n\fFEB\n\nFED\n\nFEM\n\n \n\n \n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n80\n\n78\n\n76\n\n74\n\n72\n\n \n\n78\n\n76\n\n74\n\n72\n\n70\n\n \n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n \n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n \n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n78\n\n76\n\n74\n\n72\n\n70\n\n \n\n80\n\n78\n\n76\n\n74\n\n72\n\n70\n\n78\n\n76\n\n74\n\n72\n\n70\n\n68\n\n66\n\n \n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nGED\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nJED\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nGEB\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nJEB\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nGEM\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\nJEM\n\n2000\n\n \n\nCL\u2212LSI\nCL\u2212KCCA\nCL\u2212OPCA\nTSL\n\n500\n\n1500\nUnlabeled parallel data\n\n1000\n\n2000\n\nFigure 2: Average test classi\ufb01cation accuracies and standard deviations over 10 runs with different\nnumbers of unlabeled parallel documents for adapting a classi\ufb01cation system from French, German\nand Japanese to English.\n\nFrom these results, we can see that the performance of all four methods in general improves with the\nincrease of the unlabeled parallel data. The proposed method, TSL, nevertheless outperforms the\nother three cross-lingual adaptation learning methods across the range of different np values for 16\nout of the 18 cross language sentiment classi\ufb01cation tasks. For the remaining two tasks, EFM and\nEGM, it has similar performance with the CL-KCCA method while signi\ufb01cantly outperforming the\nother two methods. Moreover, for the 9 tasks that make adaptation from English to the other three\nlanguages, the TSL method achieves great performance with only 200 unlabeled parallel documents,\nwhile the performance of the other three methods decreases signi\ufb01cantly with the decrease of the\nnumber of unlabeled parallel documents. These results demonstrate the robustness and ef\ufb01cacy of\nthe proposed method, comparing to other methods.\n\n6 Conclusion\n\nIn this paper, we developed a novel two-step method to learn cross-lingual semantic data representa-\ntions for cross language text classi\ufb01cation by exploiting unlabeled parallel bilingual documents. We\n\ufb01rst formulated a matrix completion problem to infer unobserved feature values of the concatenated\ndocument-term matrix in the space of uni\ufb01ed vocabulary set from the source and target languages.\nThen we performed latent semantic indexing over the completed low-rank document-term matrix to\nproduce a low-dimensional cross-lingual representation of the documents. Monolingual classi\ufb01ers\nwere then used to conduct cross language text classi\ufb01cation based on the learned document repre-\nsentation. To investigate the effectiveness of the proposed learning method, we conducted extensive\nexperiments with tasks of cross language sentiment classi\ufb01cation on Amazon product reviews. Our\nexperimental results demonstrated that the proposed two-step learning method signi\ufb01cantly out-\nperforms the other four comparison methods. Moreover, the proposed approach needs much less\nparallel documents to produce a good cross language text classi\ufb01cation system.\n\n8\n\n\fReferences\n\n[1] M. Amini and C. Goutte. A co-classi\ufb01cation approach to learning from multilingual corpora.\n\nMachine Learning, 79:105\u2013121, 2010.\n\n[2] M. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views - an\n\napplication to multilingual text categorization. In NIPS, 2009.\n\n[3] B. A.R., A. Joshi, and P. Bhattacharyya. Cross-lingual sentiment analysis for indian languages\n\nusing linked wordnets. In Proc. of COLING, 2012.\n\n[4] E. Cand\u00b4es and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[5] C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACM Transactions on\n\nIntelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[6] W. Dai, Y. Chen, G. Xue, Q. Yang, and Y. Yu. Translated learning: Transfer learning across\n\ndifferent feature spaces. In NIPS, 2008.\n\n[7] A. Gliozzo. Exploiting comparable corpora and bilingual dictionaries for cross-language text\n\ncategorization. In Proc. of ICCL-ACL, 2006.\n\n[8] J. Jagarlamudi, R. Udupa, H. Daum\u00b4e III, and A. Bhole. Improving bilingual projections via\n\nsparse covariance matrices. In Proc. of EMNLP, 2011.\n\n[9] X. Ling, G. Xue, W. Dai, Y. Jiang, Q. Yang, and Y. Yu. Can chinese web pages be classi\ufb01ed\n\nwith English data source? In Proc. of WWW, 2008.\n\n[10] M. Littman, S. Dumais, and T. Landauer. Automatic cross-language information retrieval using\nlatent semantic indexing. In Cross-Language Information Retrieval, chapter 5, pages 51\u201362.\nKluwer Academic Publishers, 1998.\n\n[11] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank\n\nminimization. Mathematical Programming: Series A and B archive, 128, Issue 1-2, 2011.\n\n[12] X. Meng, F. Wei, X. Liu, M. Zhou, G. Xu, and H. Wang. Cross-lingual mixture model for\n\nsentiment classi\ufb01cation. In Proc. of ACL, 2012.\n\n[13] J. Pan, G. Xue, Y. Yu, and Y. Wang. Cross-lingual sentiment classi\ufb01cation via bi-view non-\n\nnegative matrix tri-factorization. In Proc. of PAKDD, 2011.\n\n[14] P. Petrenz and B. Webber. Label propagation for \ufb01ne-grained cross-lingual genre classi\ufb01cation.\n\nIn Proc. of the NIPS xLiTe workshop, 2012.\n\n[15] J. Platt, K. Toutanova, and W. Yih. Translingual document representations from discriminative\n\nprojections. In Proc. of EMNLP, 2010.\n\n[16] P. Prettenhofer and B. Stein. Cross-language text classi\ufb01cation using structural correspondence\n\nlearning. In Proc. of ACL, 2010.\n\n[17] L. Rigutini and M. Maggini. An EM based training algorithm for cross-language text catego-\n\nrization. In Proc. of the Web Intelligence Conference, 2005.\n\n[18] J. Shanahan, G. Grefenstette, Y. Qu, and D. Evans. Mining multilingual opinions through\nclassi\ufb01cation and translation. In AAAI Spring Symp. on Explor. Attit. and Affect in Text, 2004.\n[19] W. Smet, J. Tang, and M. Moens. Knowledge transfer across multilingual corpora via latent\n\ntopics. In Proc. of PAKDD, 2011.\n\n[20] A. Vinokourov, J. Shawe-taylor, and N. Cristianini. Inferring a semantic representation of text\n\nvia cross-language correlation analysis. In NIPS, 2002.\n\n[21] C. Wan, R. Pan, and J. Li. Bi-weighting domain adaptation for cross-language text classi\ufb01ca-\n\ntion. In Proc. of IJCAI, 2011.\n\n[22] X. Wan. Co-training for cross-lingual sentiment classi\ufb01cation. In Proc. of ACL-IJCNLP, 2009.\n[23] K. Wu, X. Wang, and B. Lu. Cross language text categorization using a bilingual lexicon. In\n\nProc. of IJCNLP, 2008.\n\n9\n\n\f", "award": [], "sourceid": 646, "authors": [{"given_name": "Min", "family_name": "Xiao", "institution": "Temple University"}, {"given_name": "Yuhong", "family_name": "Guo", "institution": "Temple University"}]}