{"title": "Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1497, "page_last": 1504, "abstract": "", "full_text": "Inferring a Semantic Representation of Text\n\nvia Cross-Language Correlation Analysis\n\nAlexei Vinokourov\nJohn Shawe-Taylor\n\nDept. Computer Science\n\nRoyal Holloway, University of London\n\nEgham, Surrey, UK, TW20 0EX\nalexei@cs.rhul.ac.uk\n\nNello Cristianini\nDept. Statistics\n\nUC Davis, Berkeley, US\n\nnello@support-vector.net\n\njohn@cs.rhul.ac.uk\n\nAbstract\n\nThe problem of learning a semantic representation of a text document\nfrom data is addressed, in the situation where a corpus of unlabeled\npaired documents is available, each pair being formed by a short En-\nglish document and its French translation. This representation can then\nbe used for any retrieval, categorization or clustering task, both in a stan-\ndard and in a cross-lingual setting. By using kernel functions, in this case\nsimple bag-of-words inner products, each part of the corpus is mapped\nto a high-dimensional space. The correlations between the two spaces\nare then learnt by using kernel Canonical Correlation Analysis. A set\nof directions is found in the \ufb01rst and in the second space that are max-\nimally correlated. Since we assume the two representations are com-\npletely independent apart from the semantic content, any correlation be-\ntween them should re\ufb02ect some semantic similarity. Certain patterns of\nEnglish words that relate to a speci\ufb01c meaning should correlate with cer-\ntain patterns of French words corresponding to the same meaning, across\nthe corpus. Using the semantic representation obtained in this way we\n\ufb01rst demonstrate that the correlations detected between the two versions\nof the corpus are signi\ufb01cantly higher than random, and hence that a rep-\nresentation based on such features does capture statistical patterns that\nshould re\ufb02ect semantic information. Then we use such representation\nboth in cross-language and in single-language retrieval tasks, observing\nperformance that is consistently and signi\ufb01cantly superior to LSI on the\nsame data.\n\n1 Introduction\n\nMost text retrieval or categorization methods depend on exact matches between words.\nSuch methods will, however, fail to recognize relevant documents that do not share words\nwith a users\u2019 queries. One reason for this is that the standard representation models (e.g.\nboolean, standard vector, probabilistic) treat words as if they are independent, although it\nis clear that they are not. A central problem in this \ufb01eld is to automatically model term-\n\n\fterm semantic interrelationships, in a way to improve retrieval, and possibly to do so in an\nunsupervised way or with a minimal amount of supervision. For example latent seman-\ntic indexing (LSI) has been used to extract information about co-occurrence of terms in\nthe same documents, an indicator of semantic relations, and this is achieved by singular\nvalue decomposition (SVD) of the term-document matrix. The LSI method has also been\nadapted to deal with the important problem of cross-language retrieval, where a query in\na language is used to retrieve documents in a different language. Using a paired corpus (a\nset of pairs of documents, each pair being formed by two versions of the same text in two\ndifferent languages), after merging each pair into a single \u2019document\u2019, we can interpret fre-\nquent co-occurrence of two terms in the same document as an indication of cross-linguistic\ncorrelation [5]. In this framework, a common vector-space, including words from both\nlanguages, is created and then the training set is analysed in this space using SVD. This\nmethod, termed CL-LSI, will be brie\ufb02y discussed in Section 4. More generally, many\nother statistical and linear algebra methods have been used to obtain an improved semantic\nrepresentation of text data over LSI [6]. In this study we address the problem of learning\na semantic representation of text from a paired bilingual corpus, a problem that is impor-\ntant both for mono-lingual and cross-lingual applications. This problem can be regarded\neither as an unsupervised problem with paired documents, or as a supervised monolingual\nproblem with very complex labels (i.e. the label of an english document could be its french\ncounterpart). In either way, the data can be readily obtained without an explicit labeling\neffort, and furthermore there is not the loss of information due to compressing the mean-\ning of a document into a discrete label. We employ kernel Canonical Correlation Analysis\n(KCCA) [1] to learn a representation of text that captures aspects of its meaning. Given\na paired bilingual corpus, this method de\ufb01nes two embedding spaces for the documents\nof the corpus, one for each language, and an obvious one-to-one correspondence between\npoints in the two spaces. KCCA then \ufb01nds projections in the two embedding spaces for\nwhich the resulting projected values are highly correlated. In other words, it looks for par-\nticular combinations of words that appear to have the same co-occurrence patterns in the\ntwo languages. Our hypothesis is that \ufb01nding such correlations across a paired crosslingual\ncorpus will locate the underlying semantics, since we assume that the two languages are\n\u2019conditionally independent\u2019, or that the only thing they have in common is their meaning.\nThe directions would carry information about the concepts that stood behind the process of\ngeneration of the text and, although expressed differently in different languages, are, never-\ntheless, semantically equivalent. To illustrate such representation we have printed the most\nprobable (most typical) words in each language for some of the \ufb01rst few kernel canonical\ncorrleation components found for bilingual 36 \u0002\u0001 Canadian Parliament corpus (Hansards)\n(left column is English space and right column is French space):\n\nPENSIONS PLAN?\n\nAGRICULTURE?\n\nCANADIAN LANDS?\n\nFISHING INDUSTRY?\n\npension\nplan\ncpp\ncanadians\nbene\ufb01ts\nretirement\nfund\ntax\ninvestment\nincome\n\ufb01nance\nyoung\nyears\nrate\nsuperannuation\ndisability\ntaxes\nmounted\nfuture\npremiums\nseniors\ncountry\nrates\njobs\npay\n\nregime\npensions\nrpc\nprestations\ncanadiens\nretraite\ncotisations\nfonds\ndiscours\nimp\u02c6ot\nrevenu\njeunes\nans\npension\nargent\nregimes\ninvestissement\nmilliards\nprestation\nplan\n\ufb01nances\npays\navenir\ninvalidit\nresolution\n\nwheat\nboard\nfarmers\nnewfoundland\ngrain\nparty\namendment\nproducers\ncanadian\nspeaker\nreferendum\nminister\ndirectors\nquebec\nspeech\nschool\nsystem\nmarketing\nprovinces\nconstitution\nthrone\nmoney\nsection\nrendum\nmajorit\n\nbl\ncommission\nagriculteurs\nproducteurs\ncanadienne\ngrain\nparti\nconseil\ncommercialisat\nneuve\nministre\nadministration\nmodi\ufb01cation\nqubec\nterre\nformistes\npartis\ngrains\nop\nnationale\nlus\nbloc\nnations\nchambre\nadministration\n\npark\nland\naboriginal\nyukon\nmarine\ngovernment\nvalley\nwater\nboards\nterritories\nboard\nnorth\nparks\nresource\nagreements\nnorthwest\nresources\ndevelopment\ntreaty\nnations\nterritoire\nwork\nterritory\natlantic\nprograms\n\nparc\nautochtones\nterres\nches\nvall\nressources\nyukon\nnord\ngouvernement\nof\ufb01ces\nmarin\neaux\nterritoires\nparcs\nnations\nterritoriales\nrevendications\nministre\ncheurs\nouest\nentente\nrights\nof\ufb01ce\natlantique\nententes\n\n\ufb01sheries\natlantic\noperatives\n\ufb01shermen\nnewfoundland\n\ufb01shery\nproblem\noperative\n\ufb01shing\nindustry\n\ufb01sh\nyears\nproblems\nwheat\ncoast\noceans\nwest\nsalmon\ntags\nminister\ncommunities\nprogram\ncommission\nmotion\nstocks\n\np\u02c6eches\natlantique\np\u02c6echeurs\np\u02c6eche\nprobl\ncoop\nans\nindustrie\npoisson\nneuve\nterre\nouest\nstocks\nratives\nministre\nsant\nsaumon\naffaiblies\nfacult\nsecteur\nprogramme\ngion\nscienti\ufb01ques\ntravailler\nconduite\n\nThis representation is then used for retrieval tasks, providing better performance than\nexisting techniques. Such directions are then used to calculate the coordinates of the\n\n\fdocuments in a \u2019language independent\u2019 way. Of course, particular statistical care is needed\nfor excluding \u2019spurious\u2019 correlations. We show that the correlations we \ufb01nd are not the\neffect of chance, and that the resulting representation signi\ufb01cantly improves performance\nof retrieval systems. We \ufb01nd that the correlation existing between certain sets of words\nin English and French documents cannot be explained as a random correlation. Hence\nwe need to explain it by means of relations between the generative processes of the two\nversions of the documents, that we assume to be conditionally independent given the topic\nor content. Under such assumptions, hence, such correlations detect similarities in content\nbetween the two documents, and can be exploited to derive a semantic representation of\nthe text. This representation is then used for retrieval tasks, providing better performance\nthan existing techniques. We \ufb01rst apply the method to crosslingual information retrieval,\ncomparing performance with a related approach based on latent semantic indexing (LSI)\ndescribed below [5]. Secondly, we treat the second language as a complex label for the\n\ufb01rst language document and view the projection obtained by CL-KCCA as a semantic\nmap for use in a multilingual classi\ufb01cation task with very encouraging results. From the\ncomputational point of view, we detect such correlations by solving an eigenproblem, that\nis avoiding problems like local minima, and we do so by using kernels.\n\nThe KCCA machinery will be given in Section 3 and in Section 4 we will show how to\napply KCCA to cross-lingual retrieval while Section 4 describes the monolingual applica-\ntions. Finally, results will be presented in Section 5.\n\n2 Previous work\n\nThe use of LSI for cross-language retrieval was proposed by [5]. LSI uses a method from\nlinear algebra, singular value decomposition, to discover the important associative relation-\nships. An initial sample of documents is translated by human or, perhaps, by machine, to\n\npreprocessing documents a common vector-space, including words from both languages,\nis created and then the training set is analysed in this space using SVD:\n\ncreate a set of dual-language training documents\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\r! \u0015\"$#&%('\n\ning the \ufb01rst language features and the second set the second language features. To translate\nto a language-independent representation one projects (folds-in)\nits expanded (\ufb01lled up with zero components related to another language) vector represen-\n\n\u000f\u001b\r\u001d\u001c\ncorresponds to document) with its \ufb01rst set of coordinates giv-\n* . The similarity\n\ninto the space spanned by the, \ufb01rst eigenvectors .- : /\n\nwhere the) -th column of\u000f\na new document (query)*\ntation +\n\nbetween two documents is measured as the inner product between their projections. The\ndocuments that are the most similar to the query are considered to be relevant.\n\n1 \n\n*\u00190\n\n\u0003\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011 and\u000f\u0014\u0013\u0015\n\n\u0002\u0016\u0017\u0003\u0018\u0005\u0019\u0007\n\n\u0003\u001a\t\f\u000b . After\n\n(1)\n\n\u000f\u0014\u0013\u001f\u001e\n\n3 Kernel Canonical Correlation Analysis\n\nIn this study our aim is to \ufb01nd an appropriate language-independent representation. Sup-\npose as for cross-lingual LSI (CL-LSI) we are given aligned texts in, for simplicity, two\nin another\n\nlanguages, i.e., every text in one language\u0001\u0004\u00033254\nis a translation of text\u00166\u00033287\nlanguage or vice versa. Our hypothesis is that having the corpus\u0019\u0001\u0004\u00039\u0005\u0019\u0007\n\u0003\u001a\t\f\u000b mapped to a high-\n\u00166\u0003@? (withA\n\u0011 and\n\u0013 as;=<\nto:\ndimensional feature space:\n\u0013 being respectively the kernels of the two mappings, i.e. matrices of the inner products\nbetween images of all the data points, [2]), we can learn (semantic) directions B\n\u0001>\u0003D?E?\u0018\u0007\n\u0016\u0017\u0003G?9?9\u0007\n\u0003\u001a\t\f\u000b and <DB\n\u0003\n\t\f\u000b\n\u0013F'\nandB\n;=<\n;=<\n\n\u0001>\u0003\u0006? and corpus\u0002\u00166\u0003\u0006\u0005\b\u0007\nin those spaces so that the projections <@B\n\nof input data images from the different languages would be maximally correlated. We\n\n\u0011 as;=<\n\n\u0003\n\t\f\u000b\n\u0011C'\n\n\u000f\n\u0011\n*\n%\n-\n+\nA\n\u0011\n2\n:\n\u0011\n\u0013\n2\n:\n\u0013\n\f<@B\n\n\u0011\u0004\u0003\n\nrewrites the numerator of (2) as\n\nhave thus intuitively de\ufb01ned a need for the notion of a kernel canonical:\n\u0013 ) which is de\ufb01ned as\n(:\n\u0006\b\u0007\n\t\n\u00166\u0003@?9?E?\n\u0003G?9?\n\u0013C'\n\u0018\u0015\u0019\u001b\u001a\u001c\u001a\n\u0005\u0001\n<E<@B\n;=<\n<@B\n;=<\n\f\u0013\u0012\u0015\u0014\u0017\u0016\n\u000b\r\f\u000f\u000e\u0011\u0010\n\u0016\u0017\u0003G?9?\n\u0003\u0006?9?\n\u0013F'\n\u0006\b\u0007\n\t\n;=<\n<@B\n;=<\n\f\u0013\u0012\u0015\u0014\u0017\u0016\n\u000b\r\f\u000f\u000e\u0011\u0010\n?E?\u0013\u001f\n?E?\u0013\u001f \u001d\"!\n;=<\n<@B\n;=<\n\u0011 andB\nWe search forB\nin the space spanned by the;\n\u001d('*)\n\u001d$#&%\n? ,B\n\u00138\r\nducing kernel Hilbert space, RKHS [2]): B\n;=<\n;=<\n?9?\n?9?\n<@B\n;=<\n<@B\n;=<\n\u0005 and)\nwhere%\nthe vector with components\nis the vector with components\n\r$\u0006\b\u0007\n\t\n.0.\u0005.0.\n./.\n\nproblem (2) can then be reformulated as\n\n<DB\n\n.0.\n\n-correlation\u0002\u0001\n\n(2)\n\n(3)\n\n? . This\n\u0005 . The\n\n(4)\n\n-images of the data points (repro-\n\n;=<\n\n./.\n\n./.\n\nPartial Least Squares (PLS) we would add a multiple of 2-norm squared:\n\nin the denominator. Convexly combining PLS regularization term (5) and kCCA term\n\n\u0013 . To do that in the spirit of\n\nOnce we have moved to a kernel de\ufb01ned feature space the extra \ufb02exibility introduced means\nthat there is a danger of over\ufb01tting. By this we mean that we can \ufb01nd spurious correlations\nby using large weight vectors to project the data so that the two projections are completely\naligned. For example, if the data are linearly independent in both feature spaces we can\n\ufb01nd linear transformations that map the input data to an orthogonal basis in each feature\nperfect correlations between the two representations.\nUsing kernel functions will frequently result in linear independence of the training set, for\nexample, when using Gaussian kernels. It is clear therefore that we will need to introduce\n\nspace. It is now possible to \ufb01nd1\n\u0011 andB\na control on the \ufb02exibility of the projection mappingsB\n?76\n2*3\n;=<\n.0.\n\u001f=<\n.0.\n2?>\n<E<\u00138:9\n<\u000f8:9\n.0. and do the same for\nwe substitute its square root into denominator of (4) instead of.0.\n\r(\u0006\b\u0007\n\t\n.0.\n.0.\n?B./.\n.0.\n\u001f\b?\n.0.\n<E<\u00138A9\nDifferentiating the expression under \u0006\b\u0007\n\t with respect to%\n./.\nEF./.\nC\u0004D\nGHG and I\nGHG\n\u001f&N\n.0.\n?B./.\n)ML\n\rKO\n2?>\n<9<\u000f8A9\n<\u00138A9\n.0.\n./.\n.0.\nWe note that% can be normalised so that<\u00138P9\n8 . Similar operations\nfor) yield analogous equations that together with (8) can be written in a matrix form:\nQSR\n\u0016C?9? :\n\u0001>?E? and<DB\n\u0013F'\nis the average per point correlation between projections<DB\nwhere\n;=<\n;=<\n2?>\n<E<\u00138A9\n\u001e(10)\n2?>\n<E<\u00138A9\n\n<\u000f8:9\n?B./.\n\rKJ\n./.\n\n, and equating the derivative to zero we obtain\n\n\u001f :\n.0.\n?;.0.\n\u001b\u0001\n\n.0.\n<\u00138A9\n\n, taking into account that\n\n\u000f\u001b\n\n.0.\n\n\u001f\b?\n\n(8)\n\n(9)\n\n?B./.\n\n(5)\n\n(6)\n\n(7)\n\n#54\n\n./.\n\n, and\n\n.0.\n\n.0.\n\n:\n\n:\n:\n\n\u0001\n\u0011\n'\n\u0001\n'\n\n\u0001\n\u001d\n\u0003\n\u0011\n'\n\u0001\n\u001e\n\u001d\n\u0003\n\u0011\n'\n\u0001\n\u0003\n\u0013\n'\n\u0016\n!\n\u0013\n\u0011\n\n#\n\u0001\n#\n'\n\u0016\n'\n+\n\u0003\n\u0011\n'\n\u0001\n\u0003\n\u0013\n'\n\u0016\n\u0003\n\n%\n%\nA\n\u0011\nA\n\u0013\n)\n%\n#\n)\n'\n\n\u0001\n,\n\u0010\n-\n%\n%\nA\n\u0011\nA\n\u0013\n)\nA\n\u0011\n%\nA\n\u0013\n)\n2\nB\n\u0011\n\u001f\n\n+\n#\n%\n#\n\u0001\n#\n?\n'\n+\n%\n#\n4\n\u0001\n#\n4\n\n2\n%\n%\nA\n\u0011\n%\nA\n\u0011\n%\n2\nA\n\u0011\n%\n2\nB\n\u0011\n\u001f\n\n2\n?\n%\n%\nA\n\u001f\n\u0011\n%\n<\n2\n%\n%\nA\n\u0011\n%\n\n%\n%\nA\n\u0011\n2\n?\nA\n\u0011\n<\n?\n%\nA\n\u0011\n%\n)\n,\n\u0010\n-\n%\n%\nA\n\u0011\nA\n\u0013\n)\n@\n<\n2\nA\n\u0011\n%\n\u001f\n<\n2\nB\n\u0011\n2\nA\n\u0013\n)\n\u001f\n<\n2\nB\n\u0013\n\nD\nD\nI\n,\n%\n%\nA\n\u0011\n%\nA\n\u0011\n%\nA\n\u0011\nA\n\u0013\n2\nA\n\u0011\n%\n\u001f\n<\n2\nB\n\u0011\n9\n%\n%\nA\n\u0011\nA\n\u0013\n)\n2\n?\nA\n\u0011\n<\n?\nA\n\u0011\n%\n2\nA\n\u0011\n%\n\u001f\n<\n2\nB\n\u0011\n\u001f\n\n\n\u000f\nR\n\u0011\n'\n%\n%\nA\n\u0011\nA\n\u0013\n)\nQ\n\n\u001c\nT\nA\n\u0011\nA\n\u0013\nA\n\u0013\nA\n\u0011\nT\n\u001e\n'\n\u001c\n2\n?\nA\n\u0011\n<\n?\nA\n\u0011\nT\nT\n2\n?\nA\n\u0013\n<\n?\nA\n\u0013\n\fTable 1: Statistics for \u2019House debates\u2019 of the 36 \u0002\u0001 Canadian Parliament proceedings corpus.\n\nSENTENCE\nPAIRS\n\nTRAINING\nTESTING 1\n\n948K\n62K\n\nENGLISH\nWORDS\n14,614K\n995K\n\nFRENCH\nWORDS\n15,657K\n1067K\n\n\u0001\n\nallows us, after simple transformations, to rewrite it as a standard eigenvalue problem\n\n. Equation (9) is known as a generalised eigenvalue problem.The\nis to perform incom-\n\nwhereR\n%\u0003\u0002\nstandard approach to the solution of (9) in the case of a symmetric\u000f\nR which\nplete Cholesky decomposition of the matrix \u000f\n: \u000f\n. We will discuss how to choose2\n\u0004\t\b\nIt is easy to see that if% or) changes sign in (9), also changes sign. Thus, the spectrum\nof the problem (9) has paired positive and negative values between9\n\n4 Applications of KCCA\n\n8 and8 .\n\nin Section 5.\n\n\u0007\u0004\n\n\u0005\u0004\n\nand de\ufb01ne\n\n\n\u0006\n\n\u0004\t\b\n\nCross-linguistic retrieval with KCCA. The kernel CCA procedure identi\ufb01es a set of pro-\njections from both languages into a common semantic space. This provides a natural frame-\nwork for performing cross-language information retrieval. We \ufb01rst select a number\nof\n\n-correlation components:\n\nmatrix whose columns are the \ufb01rst solutions\nof (9) for the given language sorted by eigenvalue in descending order. Here we assumed\nis the training corpus in the given language:\n\nit onto the\n\nsemantic dimensions,8\r\f\u000e\u000b\u000f\f\ncoming query* we expand*\ncanonical:\nvector for that language, where \u0011\n?9?\nthat <G;=<\u0016\u0015\n;=<\u0017\u0010\n? or \u0013\n\u0018\u0019\u0018\u0019\u0018\n\ninto the vector representation for its language\n\n, with largest correlation values . To process an in-\n* and project\n* using the appropriate\nis a1\n* where \u0013\n\n\u0012\u0011\n? .\n\nUsing the semantic space in text categorisation. The semantic vectors in the given lan-\nguage\ncan be exported and used in some other application, for example, Support\nVector Machine classi\ufb01cation. We \ufb01rst \ufb01nd common features of the training data used to\nextract the semantics and the data used to train SVM classi\ufb01er, cut the features that are not\ncommon and compute the new kernel which is the inner product of the projected data:\n\n\u001f\u001a\u0018\u001b\u0018\u0019\u0018\n\nis simply\n\n%\u0014\u0013\n\n*\u00190\n\nThe term-term relationship matrix\nuse in the SVM learning process and classi\ufb01cation.\n\nA\u001e\u001dD<\n\u001c\u001f\u001c\n\n5 Experiments\n\n\u001c\u001f\u001c\n\n% can be computed only once and stored for further\n\n(11)\n\nExperimental setup. Following [5] we conducted a series of experiments with the Hansard\ncollection [3] to measure the ability of CL-LSI and CL-KCCA for any document from a\ntest collection in one language to \ufb01nd its mate in another language. The whole collec-\ntion consists of 1.3 million pairs of aligned text chunks (sentences or smaller fragments)\nfrom the 36 \u0002\u0001 Canadian Parliament proceedings.\nIn our experiments we used only the\n\u2019house debates\u2019 part for which statistics is given in Table 1. As a testing collection we\nused only \u2019testing 1\u2019. The raw text was split into sentences with Adwait Ratnaparkhi\u2019s\nMXTERMINATOR and the sentences were aligned with I. Dan Melamed\u2019s GSA tool (for\ndetails on the collection and also for the source see [3]).\n\n%\n%\n)\n%\n%\n\u0004\n\u0006\n%\nQ\n\u000b\n\u0006\n\n\u000b\n1\n\u0010\n\u000b\n/\n%\n\u0010\n\u0003\n\u000b\n?\n'\n*\n\u0015\n%\n\u0010\n\u0013\n\n<\n\u0001\n\u000b\n\u0001\n\u001f\n\u0001\n\u0007\n\n<\n\u0016\n\u000b\n\u0016\n\u0016\n\u0007\n\u001c\n\n\u0013\n\u0011\n\u0001\n\u0003\n'\n\u0001\n!\n?\n\n\u0001\n%\n\u0003\n%\n\u0001\n!\n\fTable 2: Average accuracy of top-rank (\ufb01rst retrieved) English French retrieval, %\n(left) and average precision of English French retrieval over set of \ufb01xed recalls\n82\u0003\u0005\b\n87\u0003\f\u000b\n\n\u0018\u001b\u0018\u0019\u0018\n84\u0003\u0005\u0004\n98\u0003\u0007\u0006\n\n\u0001 ), % (right)\n\n93\u0003\u0007\u0006\n99\u0003\u0007\u0006\n\n80\u0003\u0007\u0006\n91\u0003\n\t\n\n73\u0003\u0007\u0006\n91\u0003\n\t\n\n95\u0003\u0007\u0006\n99\u0003\u0007\u0006\n\n91\u0003\u0007\u0006\n99\u0003\u0007\u0006\n\n200\n\n78\u0003\u0007\u0006\n91\u0003\n\t\n\n400\n\n82\u0003\u0007\u0006\n91\u0003\n\t\n\nfull\n\n97\u0003\u0007\u0006\n99\u0003\u0007\u0006\n\ncl-lsi\ncl-kcca\n\nJC'\n\n'\u001cO\n\n'\u001cO\n\n(O\n\n100\n\n200\n\n300\n\n400\n\n100\n\n300\n\nfull\n\nThe text chunks were split into \u2019paragraphs\u2019 based on \u2019***\u2019 delimiters and these \u2019para-\ngraphs\u2019 were treated as separate documents. After removing stop-words in both French and\n\nJ\u0014\u000f\u0017\u0011\u0016\u0013\n\n\u2019French\u2019 matrix (we also removed\na few documents that appeared to be problematic when split into paragraphs). As these\nmatrices were still too large to perform SVD and KCCA on them, we split the whole col-\nlection into 14 chunks of about 910 documents each and conducted experiments separately\nwith them, measuring the performance of the methods each time on a 917-document test\ncollection. The results were then averaged. We have also trained the CL-KCCA method\non randomly reassociated French-English document pairs and observed accuracy of about\n\nEnglish parts and rare words (i.e. appearing less than three times) we obtained \rP8\u000e\r\nterm-by-document \u2019English\u2019 matrix and \r\u0016\u0015P8\n8\u000e\r on test data which is far lower than results on the non-random original data. It is worth\n\nnoting that CL-KCCA behaves differently from CL-LSI over the full scale of the spectrum.\nWhen CL-LSI only increases its performance with more eigenvectors taken from the lower\npart of spectrum (which is, somewhat unexpectedly, quite different from its behaviour in\nthe monolinguistic setting), CL-KCCA\u2019s performance, on the contrary, tends to deteriorate\nwith the dimensionality of the semantic subspace approaching the dimensionality of the\ninput data space.\n\nJ\u0010\u000f\u0012\u0011\u0014\u0013\n\nThe partial Singular Value Decomposition of the matrices was done using Matlab\u2019s \u2019svds\u2019\nfunction and full SVD was performed using the \u2019kernel trick\u2019 discussed in the previous\nsection and \u2019svd\u2019 function which took about 2 minutes to compute on Linux Pentium III\n1GHz system for a selection of 1000 documents. The Matlab implementation of KCCA\nusing the same function, \u2019svd\u2019, which solves the generalised eigenvalue problem through\nCholesky incomplete decomposition, took about 8 minutes to compute on the same data.\nMate retrieval. The results are presented in Table 2. Only one - mate document in French\nwas considered as relevant to each of the test English documents which were treated as\nqueries and the relative number of correctly retrieved documents was computed (Table 2)\n\u0001 . Very similar results\nalong with average precision over \ufb01xed recalls:\n(omitted here) were obtained when French documents were treated as queries and English\nas test documents. As one can see from Table 2 CL-KCCA seems to capture most of\naccuracy with as little as 100\nthe semantics in the \ufb01rst few components achieving \u0001\ncomponents when CL-LSI needs all components for a similar \ufb01gure.\n\nJ , \u0018\u0019\u0018\u001b\u0018 ,O\n\n8 ,O\n\n\u0013\u0010\u0018\n\n(6) not only\nmakes the problem (9) well-posed numerically, but also provides control over capacity of\nare, the\nless sensitive the method to the input data is, therefore, the more stable (less prone to\n\ufb01nding spurious relations) the solution becomes. We should thus observe an increase of\n\u201dreliability\u201d of the solution. We measure the ability of the method to catch useful sig-\nnal by comparing the solutions on original input and \u201drandom\u201d data. The \u201drandom\u201d data\n\nSelecting the regularization parameter. The regularization parameter 2\nthe function space where the solution is being sought. The larger values of 2\n\u0007\u0016\u001b\u001d\u001c\nis constructed by random reassociations of the data pairs, for example, <\u001a\u0019\nthe paired dataset <\nperfect correlations and hence./.\n' where \"\n\n?E? de-\n? denotes the (positive part of) spectrum of the KCCA solution on\n? . If the method is over\ufb01tting the data it will be able to \ufb01nd\n\nnotes English-French parallel corpus which is obtained from the original English-French\naligned collection by reshuf\ufb02ing the French (equivalently, English) documents. Suppose,\n\u001f! \n\nis the all-one vec-\n\n?;.0.$#\n\n'E\u000f\n\n'E\u000f\n\n<\u001a\u001e\n\n\u0018\n8\n\u0018\n\u0018\n\u0002\n\u0001\n\u0003\n8\n8\n\u0003\n8\nO\n\u0018\nO\n\u0018\n\u0018\n\u0018\n'\n\u001a\n\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u000f\n\u000b\n\u001f\n\u000f\n\u000b\n'\n\u000f\n\u001f\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u000f\n\u001f\n\u000b\n\u001f\n\u001f\nO\n\f1.5\n\n1\n\n0.5\n\n0\n0\n\n1\n\n0.5\n\n1.5\n\n1\n\n0.5\n\n1\n\n2\n\n3\n\n0\n0\n\n4\n\n1\n\n2\n\n3\n\n4\n\n0\n0\n\n1\n\n2\n\n3\n\n4\n\n.\n\n<\u001a\u0019\n\n<\u001a\u001e\n\n<\u001a\u0019\n\n\u0007\u0016\u001b\u001d\u001c\n\n(Graphs were obtained for the regularization schema discussed in [1]).\n\ntor. We therefore use this as a measure to assess the degree of over\ufb01tting. Three graphs\n\n. For small\nthe spectrum of all the tests is close to the all-one spectrum (the spectrum\n\nincreases the spectrum of the randomly associated\ndata becomes far from all-one, while that of the paired documents remains correlated. This\n. From the middle and\n\nFigure 1: Quantities./.\n?9?;.0. (left),./.\n?;.0. (middle)\n<\u001a\u0019\n<\u001a\u0019\nand.0.\n?E?;.0. (right) as functions of the regularization parameter2\n\u0007\u0017\u001b\n<\u001a\u001e\n?B./. , and\nin Figure 1 show the quantities.0.\n?9?B./. ,./.\n\u0007\u0017\u001b\u001d\u001c\n.0.\n?E?;.0. as functions of the regularization parameter 2\n\u0007\u0017\u001b\nvalues of2\n? ). This indicates over\ufb01tting since the method is able to \ufb01nd correlations\neven in randomly associated pairs. As2\nobservation can be exploited for choosing the optimal value of2\nright graphs in Figure 1 this value could be derived as lying somewhere between8 andJ .\nFor the experiments reported in this study we used the value of8\n\nPseudo query test. To perform a more realistic test we generated short queries, which\nare most likely to occur in search engines, that consisted of the 5 most probable words\nfrom each test document. The relevant documents were the test documents themselves in\nmonolinguistic retrieval (English query - English document) and their mates in the cross-\nlinguistic (English query - French document) test. Table 3 shows the relative number of\ncorrectly retrieved as top-ranked English documents for English queries (left) and the rela-\ntive number of correctly retrieved documents in the top ten ranked (right). Table 4 provides\nanalogous results but for cross-linguistic retrieval.\n\n .\n\ntop-ten retrieval accuracy, % (right)\n\n200\n\n300\n\nfull\n\nTable 3: English\n53\u0003\n\t\n60\u0003\u0007\u0006\n\ncl-lsi\ncl-kcca\n\n100\n\nEnglish top-ranked retrieval accuracy, % (left) and English\n89\u0003\u0007\u0006\n60\u0003\u0007\u0006\n63\u0003\u0007\u0006\n95\u0003\u0007\u0006\n\n64\u0003\u0007\u0006\n70\u0003\u0007\u0006\n\n88\u0003\u0007\u0006\n94\u0003\u0007\u0006\n\n82\u0003\u0007\u0006\n90\u0003\u0007\u0006\n\n66\u0003\u0007\u0006\n71\u0003\u0007\u0006\n\n86\u0003\u0007\u0006\n93\u0003\u0007\u0006\n\n70\u0003\u0007\u0006\n73\u0003\u0007\u0006\n\n300\n\n400\n\n100\n\n400\n\n200\n\nEnglish\n\nfull\n\n91\u0003\u0007\u0006\n95\u0003\u0007\u0006\n\nTable 4: English\n\nten retrieval accuracy, % (right)\n\nFrench top-ranked retrieval accuracy, % (left) and English-French top-\n\ncl-lsi\ncl-kcca\n\n100\n\n30\u0003\u0007\u0006\n68\u0003\u0007\u0006\n\n200\n\n38\u0003\u0007\u0006\n75\u0003\u0007\u0006\n\n300\n\n42\u0003\n\t\n78\u0003\u0007\u0006\n\n400\n\n45\u0003\u0007\u0006\n79\u0003\u0007\u0006\n\nfull\n\n49\u0003\u0005\b\n81\u0003\u0007\u0006\n\n100\n\n67\u0003\u0007\u0006\n94\u0003\u0007\u0006\n\n200\n\n75\u0003\n\t\n96\u0003\u0007\u0006\n\n300\n\n79\u0003\n\t\n97\u0003\u0007\u0006\n\n400\n\n81\u0003\n\t\n98\u0003\u0007\u0006\n\nfull\n\n84\u0003\u0007\u0006\n98\u0003\u0007\u0006\n\nText categorisation using semantics learned on a completely different corpus. The\nsemantics (300 vectors) extracted from the Canadian Parliament corpus (Hansard) was\nused in Support Vector Machine (SVM) text classi\ufb01cation [2] of Reuters-21578 cor-\npus (Table 5). In this experimental setting the intersection of vector spaces of the Hansards,\n\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n'\n\u001a\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n'\n\u001e\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u0019\n'\n\u001a\n\u001c\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n'\n\u001a\n<\n\u0019\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u0019\n'\n\u001e\n\"\n9\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u0019\n'\n\u001a\n\u001c\nA\n\u0004\n\u0004\n\u0011\n \n<\n\u0019\n'\n\u0019\n\u0018\n\u0002\n\u0002\n\f5159 English words from the \ufb01rst 1000-French-English-document training chunk, and\nReuters ModApt split, 9962 words from the 9602 training and 3299 test documents\nhad 1473 words. The extracted\n(raw \u2019KCCA\u2019 of Table 5) and 300 eigenvectors from the same data (raw \u2019CL-LSI\u2019) were\n [4] with the kernel (11) to classify the Reuters-21578 data. The\nexperiments were averaged over 10 runs with 5% each time randomly chosen fraction of\ntraining data as the difference between bag-of-words and semantic methods is more con-\ntrasting on smaller samples. Both CL-KCCA and CL-LSI perform remarkably well when\none considers that they are based on just 1473 words. In all cases CL-KCCA outperforms\nthe bag-of-words kernel.\n\nO\u001bO KCCA vectors from English and French parts\n\nused in the SVM#\n\nTable 5: \u001e\n\nReuters-21578 data (\u2019bag-of-words\u2019) and preprocessed using semantics (300 vectors)\nextracted from the Canadian Parliament corpus by various methods.\n\n\u000b value, %, averaged over 10 subsequent runs of SVM classi\ufb01er with original\n\nCLASS\nBAG-OF-WORDS\nCL-KCCA\nCL-LSI\n\nEARN\n\n81\u0003\n90\u0003\n\t\n77\u0003\u0005\u0004\n\nACQ\n\n57\u0003\u0005\u0004\n75\u0003\f\u000b\n52\u0003\u0005\u0004\n\nGRAIN\n\nCRUDE\n\n33\u0003\u0003\u0002\n43\u0003\u0005\b\n64\u0003\u0007\u0006\n\n13\u0003\u0005\u0004\n38\u0003\u0007\u0006\n40\u0003\n\t\n\n6 Conclusions\n\nWe have presented a novel procedure for extracting semantic information in an unsuper-\nvised way from a bilingual corpus, and we have used it in text retrieval applications. Our\nmain \ufb01ndings are that: the correlation existing between certain sets of words in english and\nfrench documents cannot be explained as random correlations. Hence we need to explain\nit by means of relations between the generative processes of the two versions of the docu-\nments. The correlations detect similarities in content between the two documents, and can\nbe exploited to derive a semantic representation of the text. The representation is then used\nfor retrieval tasks, providing better performance than existing techniques.\n\nReferences\n\n[1] F. R. Bach and M. I. Jordan. Kernel indepedendent component analysis. Journal of\n\nMachine Learning Research, 3:1\u201348, 2002.\n\n[2] Nello Cristianini and John Shawe-Taylor. An introduction to Support Vector Machines\n\nand other kernel-based learning methods. Cambridge University Press, 2000.\n\n[3] Ulrich Germann.\n\nAligned Hansards of\n\nthe 36th Parliament of Canada.\n\nhttp://www.isi.edu/natural-language/download/hansard/, 2001. Release 2001-1a.\n\n[4] Thorsten\n\nJoachims.\n\nhttp://svmlight.joachims.org, 2002.\n\n-\n\nSupport\n\nVector Machine.\n\n#\u0006\u0005\n\n[5] M. L. Littman, S. T. Dumais, and T. K. Landauer. Automatic cross-language informa-\ntion retrieval using latent semantic indexing. In G. Grefenstette, editor, Cross language\ninformation retrieval. Kluwer, 1998.\n\n[6] Alexei Vinokourov and Mark Girolami. A probabilistic framework for the hierarchic\norganisation and classi\ufb01cation of document collections. Journal of Intelligent Informa-\ntion Systems, 18(2/3):153\u2013172, 2002. Special Issue on Automated Text Categorization.\n\n\u000b\n\n\u0011\n\u0003\n\n\u0001\n\u0001\n\t\n\u000b\n\u0004\n#\n\u0003\n\n\u0001\n\n\f", "award": [], "sourceid": 2324, "authors": [{"given_name": "Alexei", "family_name": "Vinokourov", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}