{"title": "Cross-Domain Matching for Bag-of-Words Data via Kernel Embeddings of Latent Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1405, "page_last": 1413, "abstract": "We propose a kernel-based method for finding matching between instances across different domains, such as multilingual documents and images with annotations. Each instance is assumed to be represented as a multiset of features, e.g., a bag-of-words representation for documents. The major difficulty in finding cross-domain relationships is that the similarity between instances in different domains cannot be directly measured. To overcome this difficulty, the proposed method embeds all the features of different domains in a shared latent space, and regards each instance as a distribution of its own features in the shared latent space. To represent the distributions efficiently and nonparametrically, we employ the framework of the kernel embeddings of distributions. The embedding is estimated so as to minimize the difference between distributions of paired instances while keeping unpaired instances apart. In our experiments, we show that the proposed method can achieve high performance on finding correspondence between multi-lingual Wikipedia articles, between documents and tags, and between images and tags.", "full_text": "Cross-Domain Matching for Bag-of-Words Data\nvia Kernel Embeddings of Latent Distributions\n\nNara Institute of Science and Technology\n\nNTT Communication Science Laboratories\n\nYuya Yoshikawa(cid:3)\n\nNara, 630-0192, Japan\n\nTomoharu Iwata\n\nKyoto, 619-0237, Japan\n\nyoshikawa.yuya.yl9@is.naist.jp\n\niwata.tomoharu@lab.ntt.co.jp\n\nHiroshi Sawada\n\nNTT Service Evolution Laboratories\n\nKanagawa, 239-0847, Japan\n\nTakeshi Yamada\n\nNTT Communication Science Laboratories\n\nKyoto, 619-0237, Japan\n\nsawada.hiroshi@lab.ntt.co.jp\n\nyamada.tak@lab.ntt.co.jp\n\nAbstract\n\nWe propose a kernel-based method for \ufb01nding matching between instances across\ndifferent domains, such as multilingual documents and images with annotations.\nEach instance is assumed to be represented as a multiset of features, e.g., a bag-of-\nwords representation for documents. The major dif\ufb01culty in \ufb01nding cross-domain\nrelationships is that the similarity between instances in different domains cannot\nbe directly measured. To overcome this dif\ufb01culty, the proposed method embeds\nall the features of different domains in a shared latent space, and regards each\ninstance as a distribution of its own features in the shared latent space. To repre-\nsent the distributions ef\ufb01ciently and nonparametrically, we employ the framework\nof the kernel embeddings of distributions. The embedding is estimated so as to\nminimize the difference between distributions of paired instances while keeping\nunpaired instances apart. In our experiments, we show that the proposed method\ncan achieve high performance on \ufb01nding correspondence between multi-lingual\nWikipedia articles, between documents and tags, and between images and tags.\n\n1 Introduction\nThe discovery of matched instances in different domains is an important task, which appears in nat-\nural language processing, information retrieval and data mining tasks such as \ufb01nding the alignment\nof cross-lingual sentences [1], attaching tags to images [2] or text documents [3], and matching user\nidenti\ufb01cations in different databases [4].\nWhen given an instance in a source domain, our goal is to \ufb01nd the instance in a target domain that\nis the most closely related to the given instance. In this paper, we focus on a supervised setting,\nwhere correspondence information between some instances in different domains is given. To \ufb01nd\nmatching in a single domain, e.g., \ufb01nd documents relevant to an input document, a similarity (or\ndistance) measure between instances can be used. On the other hand, when trying to \ufb01nd matching\nbetween instances in different domains, we cannot directly measure the distances since they con-\nsist of different types of features. For example, when matching documents in different languages,\nsince the documents have different vocabularies we cannot directly measure the similarities between\ndocuments across different languages without dictionaries.\n\n(cid:3)\n\nThe author moved to Software Technology and Arti\ufb01cial Intelligence Research Laboratory (STAIR Lab)\n\nat Chiba Institute of Technology, Japan.\n\n1\n\n\fFigure 1: An example of the proposed method\nused on a multilingual document matching\ntask. Correspondences between instances in\nsource (English) and target (Japanese) do-\nmains are observed. The proposed method as-\nsumes that each feature (vocabulary term) has\na latent vector in a shared latent space, and\neach instance is represented as a distribution\nof the latent vectors of the features associated\nwith the instance. Then, the distribution is\nmapped as an element in a reproducing kernel\nHilbert space (RKHS) based on the kernel em-\nbeddings of distributions. The latent vectors\nare estimated so that the paired instances are\nembedded closer together in the RKHS.\n\nOne solution is to map instances in both the source and target domains into a shared latent space.\nOne such method is canonical correspondence analysis (CCA) [5], which maps instances into a la-\ntent space by linear projection to maximize the correlation between paired instances in the latent\nspace. However, in practice, CCA cannot solve non-linear relationship problems due to its linearity.\nTo \ufb01nd non-linear correspondence, kernel CCA [6] can be used. It has been reported that kernel\nCCA performs well as regards document/sentence alignment between different languages [7, 8],\nwhen searching for images from text queries [9] and when matching 2D-3D face images [10]. Note\nthat the performance of kernel CCA depends on how appropriately we de\ufb01ne the kernel function\nfor measuring the similarity between instances within a domain. Many kernels, such as linear, poly-\nnomial and Gaussian kernels, cannot consider the occurrence of different but semantically similar\nwords in two instances because these kernels use the inner-product between the feature vectors rep-\nresenting the instances. For example, words, \u2018PC\u2019 and \u2018Computer\u2019, are different but indicate the\nsame meaning. Nevertheless, the kernel value between instances consisting only of \u2018PC\u2019 and con-\nsisting only of \u2018Computer\u2019 is equal to zero with linear and polynomial kernels. Even if a Gaussian\nkernel is used, the kernel value is determined only by the vector length of the instances.\nIn this paper, we propose a kernel-based cross-domain matching method that can overcome the\nproblem of kernel CCA. Figure 1 shows an example of the proposed method. The proposed method\nassumes that each feature in source and target domains is associated with a latent vector in a shared\nlatent space. Since all the features are mapped into the latent space, the proposed method can mea-\nsure the similarity between features in different domains. Then, each instance is represented as a\ndistribution of the latent vectors of features that are contained in the instance. To represent the dis-\ntributions ef\ufb01ciently and nonparametrically, we employ the framework of the kernel embeddings of\ndistributions, which measures the difference between distributions in a reproducing kernel Hilbert\nspace (RKHS) without the need to de\ufb01ne parametric distributions. The latent vectors are estimated\nby minimizing the differences between the distributions of paired instances while keeping unpaired\ninstances apart. The proposed method can discover unseen matching in test data by using the dis-\ntributions of the estimated latent vectors. We will explain matching between two domains below,\nhowever, the proposed method can be straightforwardly extended to matching between three and\nmore domains by regarding one of the domains as a pivot domain.\nIn our experiments, we demonstrate the effectiveness of our proposed method in tasks that involve\n\ufb01nding the correspondence between multi-lingual Wikipedia articles, between documents and tags,\nand between images and tags, by comparison with existing linear and non-linear matching methods.\n\n2 Related Work\nAs described above, canonical correlation analysis (CCA) and kernel CCA have been successfully\nused for \ufb01nding various types of cross-domain matching. When we want to match cross-domain\ninstances represented by bag-of-words such as documents, bilingual topic models [1, 11] can also\nbe used. The difference between the proposed method and these methods is that since the proposed\nmethod represents each instance as a set of latent vectors of its own features, the proposed method\ncan learn a more complex representation of the instance than these existing methods that represent\n\n2\n\n\feach instance as a single latent vector. Another difference is that the proposed method employs a\ndiscriminative approach, while kernel CCA and bilingual topic models employ generative ones.\nTo model cross-domain data, deep learning and neural network approaches have been recently pro-\nposed [12, 13]. Unlike such approaches, the proposed method performs non-linear matching without\ndeciding the number of layers of the networks, which largely affects their performances.\nA key technique of the proposed method is the kernel embeddings of distributions [14], which can\nrepresent a distribution as an element in an RKHS, while preserving the moment information of\nthe distribution such as the mean, covariance and higher-order moments without density estima-\ntion. The kernel embeddings of distributions have been successfully used for a statistical test of the\nindependence of two sample sets [15], discriminative learning on distribution data [16], anomaly\ndetection for group data [17], density estimation [18] and a three variable interaction test [19]. Most\nprevious studies about the kernel embeddings of distributions consider cases where the distributions\nare unobserved but the samples generated from the distributions are observed. Additionally, each\nof the samples is represented as a dense vector. With the proposed method, the kernel embedding\ntechnique cannot be used to represent the observed multisets of features such as bag-of-words for\ndocuments, since each of the features is represented as a one-hot vector whose dimensions are zero\nexcept for the dimension indicating that the feature has one. In this study, we bene\ufb01t from the kernel\nembeddings of distributions by representing each feature as a dense vector in a shared latent space.\nThe proposed method is inspired by the use of the kernel embeddings of distributions in bag-of-\nwords data classi\ufb01cation [20] and regression [21]. Their methods can be applied to single domain\ndata, and the latent vectors of features are used to measure the similarity between the features in a\ndomain. Unlike these methods, the proposed method is used for the cross-domain matching of two\ndifferent types of domain data, and the latent vectors are used for measuring the similarity between\nthe features in different domains.\n3 Kernel Embeddings of Distributions\nIn this section, we introduce the framework of the kernel embeddings of distributions. The kernel\nembeddings of distributions are used to embed any probability distribution P on space X into a re-\nproducing kernel Hilbert space (RKHS) Hk speci\ufb01ed by kernel k, and the distribution is represented\n(P) in the RKHS. More precisely, when given distribution P, the kernel embedding\nas element m\nof the distribution m\n\n(P) is de\ufb01ned as follows:\n\n(cid:3)\n\n(cid:3)\n\n\u222b\n\n(cid:3)\n\n(P) := Ex(cid:24)P[k((cid:1); x)] =\n\nk((cid:1); x)dP 2 Hk;\n\nm\n\n(1)\n(P) pre-\nwhere kernel k is referred to as embedding kernel. It is known that kernel embedding m\nserves the properties of probability distribution P such as the mean, covariance and higher-order\n\u2211\nmoments by using characteristic kernels (e.g., Gaussian RBF kernel) [22].\nWhen a set of samples X = fxlgn\nempirical distribution ^P = 1\nempirical kernel embedding m(X) is given by\n\nl=1 is drawn from the distribution, by interpreting sample set X as\nl=1 (cid:14)xl ((cid:1)), where (cid:14)x((cid:1)) is the Dirac delta function at point x 2 X ,\n\nX\n\n(cid:3)\n\nn\n\nn\n\nl=1\n\n1\nn\n\nm(X) =\n\nk((cid:1); xl);\nwhich can be approximated with an error rate of jjm(X)(cid:0)m\n(cid:3)\n2 ) [14]. Unlike ker-\nnel density estimation, the error rate of the kernel embeddings is independent of the dimensionality\nof the given distribution.\n3.1 Measuring Difference between Distributions\nBy using the kernel embedding representation Eq. (2), we can measure the difference between two\ndistributions. Given two sets of samples X = fxlgn\n\u2032\nl\u2032=1 where xl and yl\u2032 belong\nto the same space, we can obtain their kernel embedding representations m(X) and m(Y). Then,\nthe difference between m(X) and m(Y) is given by\n\nl=1 and Y = fyl\u2032gn\n\n(P)jjHk = Op(n\n\n(2)\n\n(cid:0) 1\n\n(3)\nIntuitively, it re\ufb02ects the difference in the moment information of the distributions. The difference\nis equivalent to the square of maximum mean discrepancy (MMD), which is used for a statistical test\n\n:\n\nD(X; Y) = jjm(X) (cid:0) m(Y)jj2Hk\n\nn\u2211\n\n3\n\n\fof independence of two distributions [15]. The difference can be calculated by expanding Eq. (3) as\nfollows:\n\njjm(X) (cid:0) m(Y)jj2Hk\n\n(cid:0) 2\u27e8m(X); m(Y)\u27e9Hk ;\nn\u2211\nwhere, \u27e8(cid:1);(cid:1)\u27e9Hk is an inner-product in the RKHS. In particular, \u27e8m(X); m(Y)\u27e9Hk is given by\n\n= \u27e8m(X); m(X)\u27e9Hk + \u27e8m(Y); m(Y)\u27e9Hk\n\u27e8\n\n\u2032\u2211\n\n\u2032\u2211\n\nn\u2211\n\nn\n\nn\n\n\u27e9\nk((cid:1); yl\u2032 )\n\n(4)\n\n(5)\n\n\u27e8m(X); m(Y)\u27e9Hk =\n\n1\nn\n\nl=1\n\nk((cid:1); xl);\n\n1\nn\u2032\n\nl\u2032=1\n\n=\n\n1\nnn\u2032\n\nHk\n\nl=1\n\nl\u2032=1\n\nk(xl; yl\u2032):\n\ni ; dt\n\ni)gN\n\n\u27e8m(X); m(X)\u27e9Hk and \u27e8m(Y); m(Y)\u27e9Hk can also be calculated by Eq. (5).\n4 Proposed Method\nSuppose that we are given a training set consisting of N instance pairs O = f(ds\ni=1, where ds\ni\nis the ith instance in a source domain and dt\ni is the ith instance in a target domain. These instances\ni are represented as multisets of features included in source feature set F s and target feature\ni and dt\nds\nset F t, respectively. This means that these instances are represented as bag-of-words (BoW). The\ngoal of our task is to determine the unseen relationship between instances across source and target\ndomains in test data. The number of instances in the source domain may be different to that in the\ntarget domain.\n4.1 Kernel Embeddings of Distributions in a Shared Latent Space\nAs described in Section 1, the dif\ufb01culty as regards \ufb01nding cross-domain instance matching is that the\nsimilarity between instances across source and target domains cannot be directly measured. We have\nalso stated that although we can \ufb01nd a latent space that can measure the similarity by using kernel\nCCA, standard kernel functions, e.g., a Gaussian kernel, cannot re\ufb02ect the co-occurrence of different\nbut related features in a kernel calculation between instances. To overcome them, we propose a new\ndata representation for \ufb01nding cross-domain instance matching. The proposed method assumes that\neach feature in a source feature set, f 2 F s, has a q-dimensional latent vector xf 2 Rq in a\nshared space. Likewise, each feature in target feature set, g 2 F t, has a q-dimensional latent vector\nyg 2 Rq in the shared space. Since all the features in the source and target domains are mapped into\na common shared space, the proposed method can capture the relationship between features both\nin each domain and across different domains. We de\ufb01ne the sets of latent vectors in the source and\ntarget domains as X = fxfgf2F s and Y = fyggg2F t, respectively.\nThe proposed method assumes that each instance is represented by a distribution (or multiset) of\nthe latent vectors of the features that are contained in the instance. The ith instance in the source\ndomain ds\nand the jth instance in the target\ndomain dt\n. Note that Xi and Yj lie in the\nsame latent space.\nIn Section 3, we introduced the kernel embedding representation of a distribution and described how\nto measure the difference between two distributions when samples generated from the distribution\nare observed. In the proposed method, we employ the kernel embeddings of distributions to repre-\nsent the distributions of the latent vectors for the instances. The kernel embedding representations\nfor the ith source and the jth target domain instances are given by\n1jdt\n\ni is represented by a set of latent vectors Xi = fxfgf2ds\nj is represented by a set of latent vectors Yj = fyggg2dt\n\nk((cid:1); xf );\n\nk((cid:1); yg):\n\n\u2211\n\n\u2211\n\nm(Yj) =\n\nm(Xi) =\n\n1jds\n\n(6)\n\nj\n\nj\n\ni\n\nj\n\ni\n\nf2ds\n\ni\n\nj\n\ng2dt\n\nj\n\nThen, the difference between the distributions of the latent vectors are measured by using Eq. (3),\nthat is, the difference between the ith source and the jth target domain instances is given by\n\nD(Xi; Yj) = jjm(Xi) (cid:0) m(Yj)jj2Hk\n\n:\n\n(7)\n\n4.2 Model\nThe proposed method assumes that paired instances have similar distributions of latent vectors and\nunpaired instances have different distributions. In accordance with the assumption, we de\ufb01ne the\nlikelihood of the relationship between the ith source domain instance and the jth target domain\ninstance as follows:\n\njds\ni ; X; Y; (cid:18)) =\n\np(dt\nj\n\nexp ((cid:0)D(Xi; Yj))\nj\u2032=1 exp ((cid:0)D(Xi; Yj\u2032 ))\n\nN\n\n;\n\n(8)\n\n\u2211\n\n4\n\n\f2\n\njjyjj2\n\n2\n\nx2X exp\n\ny2Y exp\n\n\u2032 \u0338= jgN\n\n((cid:0) (cid:26)\n\n((cid:0) (cid:26)\n\nwhere, (cid:18) is a set of hyper-parameters for the embedding kernel used in Eq. (6). Eq. (8) is in fact\nthe conditional probability with which the jth target domain instance is chosen given the ith source\ndomain instance. This formulation is more ef\ufb01cient than we consider a bidirectional matching.\nIntuitively, when distribution Xi is more similar to Yj than other distributions fYj\u2032 j j\nj\u2032=1,\n)\nprecision parameter (cid:26) > 0 for X and Y, that is, p(Xj(cid:26)) / \u220f\nthe probability has a higher value.\n\u220f\nWe de\ufb01ne the posterior distribution of latent vectors X and Y. By placing Gaussian priors with\n; p(Yj(cid:26)) /\n\n)\n; the posterior distribution is given by\np(X; YjO; (cid:2)) =\n(9)\n\u222b\u222b\ni=1 is a training set of N instance pairs, (cid:2) = f(cid:18); (cid:26)g is a set of hyper-\ni)gN\np(X; Y;O; (cid:2))dXdY is a marginal probability, which is constant with\n\njds\ni ; X; Y; (cid:18));\n\nwhere, O = f(ds\ni ; dt\nparameters and Z =\nrespect to X and Y.\n4.3 Learning\nWe estimate latent vectors X and Y by maximizing the posterior probability of the latent vectors\ngiven by Eq. (9). Instead of Eq. (9), we consider the following negative logarithm of the posterior\nprobability,\nL(X; Y) =\n\nexp ((cid:0)D(Xi; Yj))\n\np(Xj(cid:26))p(Yj(cid:26))\n\n1A ;\n\nN\u2211\n\n\u2211\n\nN\u2211\n\nN\u220f\n\njjxjj2\n\n2\n\np(dt\ni\n\n1\nZ\n\ni=1\n\njjyjj2\n\n2\n\n9=; +\n\n2\n\n(cid:26)\n2\n\ni=1\n\nj=1\n\n2 +\n\ny2Y\n\n(10)\nand minimize it with respect to the latent vectors. Here, maximizing Eq. (9) is equivalent to min-\nimizing Eq. (10). To minimize Eq. (10) with respect to X and Y, we perform a gradient-based\noptimization. The gradient of Eq. (10) with respect to each xf 2 X is given by\n\n8<:D(Xi; Yi) + log\n\u2211\n\n=\n\ni:f2ds\n\ni\n\n8<: @D(Xi; Yi)\n\n@xf\n\nx2X\n\njjxjj2\n\n0@\u2211\n9=; + (cid:26)xf\n\n@L(X; Y)\n\n@xf\n\nwhere,\n\n(cid:0) 1\nci\n\nN\u2211\nN\u2211\n\nj=1\n\neij\n\nci =\n\nj=1\n\n@D(Xi; Yj)\n\n@xf\n\nexp ((cid:0)D(Xi; Yj)) ;\n\n\u2211\n\n\u2211\n\n(11)\n\n(12)\n\neij = exp ((cid:0)D(Xi; Yj)) ;\n\u2211\n\n\u2211\n\nand the gradient of the difference between distributions Xi and Yj with respect to xf is given by\n\n@D(Xi; Yj)\n\n@xf\n\n=\n\n1jds\nj2\n\ni\n\n@k(xl; xl\u2032 )\n\n(cid:0)\n\n2jds\njjdt\n\ni\n\nj\n\nj\n\nl2ds\n\ni\n\nl\u20322ds\n\ni\n\n@xf\n\n@k(xl; yg)\n\n:\n\n(13)\n\nl2ds\n\ni\n\ng2dt\n\ni\n\n@xf\n\n@xf\n\nWhen the distribution Xi does not include the latent vector xf , the gradient consistently becomes a\nzero vector. @k(xl;xl\u2032 )\nis the gradient of an embedding kernel. This depends on the choice of kernel.\nWhen the embedding kernel is a Gaussian kernel, the gradient is calculated as with Eq. (15) in [21].\nSimilarly, The gradient of Eq. (10) with respect to each yg 2 Y is given by\n@D(Xi; Yj)\n\n@L(X; Y)\n\n\u2211\n\nN\u2211\n\n8<: @D(Xi; Yi)\n\n@yg\n\n9=; + (cid:26)yg;\n\n(cid:0) 1\nci\n\neij\n\nj:g2dt\n\nj\n\n@yg\n\n@yg\n\n=\n\ni=1\n\n(14)\n\nwhere, the gradient of the difference between distributions Xi and Yj with respect to yg can be\ncalculated as with Eq. (13)\nLearning is performed by alternately updating X using Eq. (11) and updating Y using Eq. (14) until\nthe improvement in the negative log likelihood Eq. (10) converges.\n4.4 Matching\nAfter the estimation of the latent vectors X and Y, the proposed method can reveal the matching\nbetween test instances. The matching is found by \ufb01rst measuring the difference between a given\nsource domain instance and target domain instances using Eq. (7), and then searching for the instance\npair with the smallest difference.\n\n5\n\n\f)\n\njjxf (cid:0) ygjj2\n\n2\n\n2\n\n(cid:0)2; 10\n\n((cid:0) (cid:13)\n\n(cid:0)1g and (cid:13) 2 f10\n\n5 Experiments\nIn this section, we report our experimental results for three different types of cross-domain datasets:\nmulti-lingual Wikipedia, document-tag and image-tag datasets.\nSetup of proposed method. Throughout these experiments, we used a Gaussian kernel with param-\neter (cid:13) (cid:21) 0: k(xf ; yg) = exp\nas an embedding kernel. The hyper-parameters of\nthe proposed method are the dimensionality of a shared latent space q, a regularizer parameter for\nlatent vectors (cid:26) and a Gaussian embedding kernel parameter (cid:13). After we train the proposed method\nwith various hyper-parameters q 2 f8; 10; 12g, (cid:26) 2 f0; 10\n(cid:0)1; 100;(cid:1)(cid:1)(cid:1) ; 103g,\nwe chose the optimal hyper-parameters by using validation data. When training the proposed\nmethod, we initialized latent vectors X and Y by applying principle component analysis (PCA)\nto a matrix concatenating two feature-frequency matrices in the source and target domains. Then,\nwe employed the L-BFGS method [23] with gradients given by Eqs. (11) (14) to learn the latent\nvectors.\nComparison methods. We compared the proposed method with the k-nearest neighbor method\n(KNN), canonical correspondence analysis (CCA), kernel CCA (KCCA), bilingual latent Dirichlet\nallocation (BLDA), and kernel CCA with the kernel embeddings of distributions (KED-KCCA). For\na test instance in the source domain, our KNN searches for the nearest neighbor source instances in\nthe training data, and outputs a target instance in the test data, which is located close to the target\ninstances that are paired with the searched for source instances. CCA and KCCA \ufb01rst learn the\nprojection of instances into a shared latent space using training data, and then they \ufb01nd matching\nbetween instances by projecting the test instances into the shared latent space. KCCA used a Gaus-\nsian kernel for measuring the similarity between instances and chose the optimal Gaussian kernel\nparameter and regularizer parameter by using validation data. With BLDA, we \ufb01rst learned the same\nmodel as [1, 11] and found matching between instances in the test data by obtaining the topic dis-\ntributions of these instances from the learned model. KED-KCCA uses the kernel embeddings of\ndistributions described in Section 3 for obtaining the kernel values between the instances. The vec-\ntor representations of features were obtained by applying singular value decomposition (SVD) for\ninstance-feature frequency matrices. Here, we set the dimensionality of the vector representations to\n100. Then, KED-KCCA learns kernel CCA with the kernel values as with the above KCCA. With\nCCA, KCCA, BLDA and KED-KCCA, we chose the optimal latent dimensionality (or number of\ntopics) within f10; 20;(cid:1)(cid:1)(cid:1) ; 100g by using validation data.\nEvaluation method. Throughout the experiments, we quantitatively evaluated the matching perfor-\nmance by using the precision with which the true target instance is included in a set of R candidate\ninstances, S(R), found by each method. More formally, the precision is given by\n\nNte\u2211\n\ni=1\n\nPrecision@R =\n\n1\nNte\n\n(cid:14) (ti 2 Si(R)) ;\n\n(15)\n\nwhere, Nte is the number of test instances in the target domain, ti is the ith true target instance,\nSi(R) is R candidate instances of the ith source instance and (cid:14)((cid:1)) is the binary function that returns\n1 if the argument is true, and 0 otherwise.\n\n5.1 Matching between Bilingual Documents\nWith a multi-lingual Wikipedia document dataset, we examine whether the proposed method can\n\ufb01nd the correct matching between documents written in different languages. The dataset includes\n34,024 Wikipedia documents for each of six languages: German (de), English (en), Finnish (\ufb01),\nFrench (fr), Italian (it) and Japanese (ja), and documents with the same content are aligned across\nthe languages. From the dataset, we create 6C2 = 15 bilingual document pairs. We regard the\n\ufb01rst component of the pair as a source domain and the other as a target domain. For each of the\nbilingual document pairs, we randomly create 10 evaluation sets that consist of 1,000 document\npairs as training data, 100 document pairs as validation data and 100 document pairs as test data.\nHere, each document is represented as a bag-of-words without stopwords and low frequency words.\nFigure 2 shows the matching precision for each of the bilingual pairs of the Wikipedia dataset.\nWith all the bilingual pairs, the proposed method achieves signi\ufb01cantly higher precision than the\nother methods with a wide range of R. Table 1 shows examples of predicted matching with the\nJapanese-English Wikipedia dataset. Compared with KCCA, which is the second best method, the\n\n6\n\n\fFigure 2: Precision of matching prediction and its standard deviation on multi-lingual Wikipedia\ndatasets.\n\nTable 1: Top \ufb01ve English documents matched by the proposed method and KCCA given \ufb01ve\nJapanese documents in the Wikipedia dataset. Titles in bold typeface indicate correct matching.\n\n(a) Japanese Input title: SD (cid:935)(cid:660)(cid:965) (SD card)\nIntel, SD card, Libavcodec, MPlayer, Freeware\nProposed\nBBC World News, SD card, Morocco, Phoenix, 24 Hours of Le Mans\nKCCA\n\n(b) Japanese Input title: (cid:2936)(cid:5739)(cid:2481) (Anthrax)\nProposed\nKCCA\n\nPsittacosis, Anthrax, Dehydration, Isopoda, Cataract\nDehydration, Psittacosis, Cataract, Hypergeometric distribution, Long Island Iced Tea\n\n(c) Japanese Input title: (cid:965)(cid:959)(cid:979)(cid:997)(cid:660)(cid:1966)(cid:1356) (Doppler effect)\nProposed\nKCCA\n\nLU deconmposition, Redshift, Doppler effect, Phenylalanine, Dehydration\nLong Island Iced Tea, Opportunity cost, Cataract, Hypergeometric distribution, Intel\n\n(d) Japanese Input title: (cid:989)(cid:937)(cid:947)(cid:943)(cid:3977)(cid:3943) (Mexican cuisine)\nProposed Mexican cuisine, Long Island Iced Tea, Phoenix, Baldr, China Radio International\nTaoism, Chariot, Anthrax, Digital Millennium Copyright Act, Alexis de Tocqueville\n\nKCCA\n\n(e) Japanese Input title: (cid:977)(cid:998)(cid:660)(cid:930)(cid:931)(cid:926) (Freeware)\nProposed\nKCCA\n\nBBC World News, Opportunity cost, Freeware, NFS, Intel\nDigital Millennium Copyright Act, China Radio International, Hypergeometric distribution, Taoism, Chariot\n\nproposed method can \ufb01nd both the correct document and many related documents. For example,\nin Table 1(a), the correct document title is \u201cSD card\u201d. The proposed method outputs the SD card\u2019s\ndocument and documents related to computer technology such as \u201cIntel\u201d and \u201cMPlayer\u201d. This is\nbecause the proposed method can capture the relationship between words and re\ufb02ect the difference\nbetween documents across different domains by learning the latent vectors of the words.\n\n5.2 Matching between Documents and Tags, and between Images and Tags\nWe performed experiments matching documents and tailgates, and matching images and tailgates\nwith the datasets used in [3]. When matching documents and tailgates, we use datasets obtained\nfrom two social bookmarking sites, delicious1 and hatena2, and patent dataset. The\ndelicious and the hatena datasets include pairs consisting of a web page and a tag list la-\nbeled by users, and the patent dataset includes pairs consisting of a patent description and a tag list\nrepresenting the category of the patent. Each web page and each patent description are represented\n\n1https://delicious.com/\n2http://b.hatena.ne.jp/\n\n7\n\n\fFigure 3: Precision of matching prediction and its standard deviation on delicious, hatena,\npatent and flickr datasets.\n\nFigure 4: Two examples of input tag lists and the top \ufb01ve images matched by the proposed method\non the flickr dataset.\n\nas a bag-of-words as with the experiments using the Wikipedia dataset, and the tag list is represented\nas a set of tags. With the matching of images and tag lists, we use the flickr dataset, which con-\nsists of pairs of images and tag lists. Each image is represented as a bag-of-visual-words, which\nis obtained by \ufb01rst extracting features using SIFT, and then applying K-means clustering with 200\ncomponents to the SIFT features. For all the datasets, the numbers of training, test and validation\npairs are 1,000, 100 and 100, respectively.\nFigure 3 shows the precision of the matching prediction of the proposed and comparison methods\nfor the delicious, hatena, patent and flickr datasets. The precision of the comparison\nmethods with these datasets was much the same as the precision of random prediction. Nevertheless,\nthe proposed method achieved very high precision particularly for the delicious, hatena and\npatent datasets. Figure 4 shows examples of input tag lists and the top \ufb01ve images matched by\nthe proposed method with the flickr dataset. In the examples, the proposed method found the\ncorrect images and similar related images from given tag lists.\n\n6 Conclusion\n\nWe have proposed a novel kernel-based method for addressing cross-domain instance matching tasks\nwith bag-of-words data. The proposed method represents each feature in all the domains as a latent\nvector in a shared latent space to capture the relationship between features. Each instance is rep-\nresented by a distribution of the latent vectors of features associated with the instance, which can\nbe regarded as samples from the unknown distribution corresponding to the instance. To calculate\ndifference between the distributions ef\ufb01ciently and nonparametrically, we employ the framework of\nkernel embeddings of distributions, and we learn the latent vectors so as to minimize the difference\nbetween the distributions of paired instances in a reproducing kernel Hilbert space. Experiments\non various types of cross-domain datasets con\ufb01rmed that the proposed method signi\ufb01cantly outper-\nforms the existing methods for cross-domain matching.\nAcknowledgments. This work was supported by JSPS Grant-in-Aid for JSPS Fellows (259867).\n\n8\n\n\fReferences\n[1] T Zhang, K Liu, and J Zhao. Cross Lingual Entity Linking with Bilingual Topic Model. In Proceedings\n\nof the Twenty-Third International Joint Conference on Arti\ufb01cial Intelligence, 2013.\n\n[2] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. A Multi-View Embedding Space\nInternational Journal of Computer Vision,\n\nfor Modeling Internet Images, Tags, and Their Semantics.\n106(2):210\u2013233, oct 2013.\n\n[3] Tomoharu Iwata, T. Yamada, and N. Ueda. Modeling Social Annotation Data with Content Relevance\n\nusing a Topic Model. In Advances in Neural Information Processing Systems. Citeseer, 2009.\n\n[4] Bin Li, Qiang Yang, and Xiangyang Xue. Transfer Learning for Collaborative Filtering via a Rating-\nIn Proceedings of the 26th Annual International Conference on Machine\n\nMatrix Generative Model.\nLearning, 2009.\n\n[5] H. Hotelling. Relations Between Two Sets of Variants. Biometrika, 28:321\u2013377, 1936.\n[6] S Akaho. A Kernel Method for Canonical Correlation Analysis. In Proceedings of International Meeting\n\non Psychometric Society, number 4, 2001.\n\n[7] Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini. Inferring a Semantic Representation of\nText via Cross-Language Correlation Analysis. In Advances in Neural Information Processing Systems,\n2003.\n\n[8] Yaoyong Li and John Shawe-Taylor. Using KCCA for Japanese-English Cross-Language Information\nRetrieval and Document Classi\ufb01cation. Journal of Intelligent Information Systems, 27(2):117\u2013133, sep\n2006.\n\n[9] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R.G. Lanckriet, Roger\nLevy, and Nuno Vasconcelos. A New Approach to Cross-Modal Multimedia Retrieval. In Proceedings of\nthe International Conference on Multimedia, 2010.\n\n[10] Patrik Kamencay, Robert Hudec, Miroslav Benco, and Martina Zachariasov. 2D-3D Face Recognition\nMethod Based on a Modi\ufb01ed CCA-PCA Algorithm. International Journal of Advanced Robotic Systems,\n2014.\n\n[11] Tomoharu Iwata, Shinji Watanabe, and Hiroshi Sawada. Fashion Coordinates Recommender System\nUsing Photographs from Fashion Magazines. In Proceedings of the Twenty-Second International Joint\nConference on Arti\ufb01cial Intelligence. AAAI Press, jul 2011.\n\n[12] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal\nDeep Learning. In Proceedings of The 28th International Conference on Machine Learning, pages 689\u2013\n696, 2011.\n\n[13] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep Canonical Correlation Analysis. In\n\nProceedings of The 30th International Conference on Machine Learning, pages 1247\u20131255, 2013.\n\n[14] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00a8olkopf. A Hilbert Space Embedding for Distri-\n\nbutions. In Algorithmic Learning Theory. 2007.\n\n[15] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Sch\u00a8olkopf, and A.J. Smola. A Kernel Statistical Test of\n\nIndependence. In Advances in Neural Information Processing Systems, 2008.\n\n[16] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch\u00a8olkopf. Learning from Dis-\ntributions via Support Measure Machines. In Advances in Neural Information Processing Systems, 2012.\n[17] Krikamol Muandet and Bernhard Sch\u00a8olkopf. One-Class Support Measure Machines for Group Anomaly\nDetection. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n[18] M Dudik, S J Phillips, and R E Schapire. Maximum Entropy Density Estimation with Generalized Regu-\nlarization and an Application to Species Distribution Modeling. Journal of Machine Learning Research,\n8:1217\u20131260, 2007.\n\n[19] Dino Sejdinovic, Arthur Gretton, and Wicher Bergsma. A Kernel Test for Three-Variable Interactions. In\n\nAdvances in Neural Information Processing Systems, 2013.\n\n[20] Yuya Yoshikawa, Tomoharu Iwata, and Hiroshi Sawada. Latent Support Measure Machines for Bag-of-\n\nWords Data Classi\ufb01cation. In Advances in Neural Information Processing Systems, 2014.\n\n[21] Yuya Yoshikawa, Tomoharu Iwata, and Hiroshi Sawada. Non-linear Regression for Bag-of-Words Data\nvia Gaussian Process Latent Variable Set Model. In Proceedings of the 29th AAAI Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n[22] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch\u00a8olkopf, and Gert R. G. Lanck-\nriet. Hilbert Space Embeddings and Metrics on Probability Measures. The Journal of Machine Learning\nResearch, 11:1517\u20131561, 2010.\n\n[23] Dong C. Liu and Jorge Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization.\n\nMathematical Programming, 45(1-3):503\u2013528, aug 1989.\n\n9\n\n\f", "award": [], "sourceid": 860, "authors": [{"given_name": "Yuya", "family_name": "Yoshikawa", "institution": "NAIST"}, {"given_name": "Tomoharu", "family_name": "Iwata", "institution": "Nippon Telegraph and Telephone Corporation"}, {"given_name": "Hiroshi", "family_name": "Sawada", "institution": "NTT Service Evolution Labs."}, {"given_name": "Takeshi", "family_name": "Yamada", "institution": "NTT Communication Science Labs."}]}