{"title": "HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 1689, "page_last": 1696, "abstract": "We present a novel paradigm for statistical machine translation (SMT), based on joint modeling of word alignment and the topical aspects underlying bilingual document pairs via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this new paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of matching words between languages, during likelihood-based training of topic-dependent translational lexicons, as well as topic representations in each language. The resulting trained HM-BiTAM can not only display topic patterns like other methods such as LDA, but now for bilingual corpora; it also offers a principled way of inferring optimal translation in a context-dependent way. Our method integrates the conventional IBM Models based on HMM --- a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model, and we report an extensive empirical analysis (in many way complementary to the description-oriented of our method in three aspects: word alignment, bilingual topic representation, and translation.", "full_text": "HM-BiTAM: Bilingual Topic Exploration, Word\n\nAlignment, and Translation\n\nBing Zhao\n\nIBM T. J. Watson Research\n\nzhaob@us.ibm.com\n\nEric P. Xing\n\nCarnegie Mellon University\nepxing@cs.cmu.edu\n\nAbstract\n\nWe present a novel paradigm for statistical machine translation (SMT), based on\na joint modeling of word alignment and the topical aspects underlying bilingual\ndocument-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM).\nIn this paradigm, parallel sentence-pairs from a parallel document-pair are cou-\npled via a certain semantic-\ufb02ow, to ensure coherence of topical context in the\nalignment of mapping words between languages, likelihood-based training of\ntopic-dependent translational lexicons, as well as in the inference of topic rep-\nresentations in each language. The learned HM-BiTAM can not only display\ntopic patterns like methods such as LDA [1], but now for bilingual corpora; it\nalso offers a principled way of inferring optimal translation using document con-\ntext. Our method integrates the conventional model of HMM \u2014 a key component\nfor most of the state-of-the-art SMT systems, with the recently proposed BiTAM\nmodel [10]; we report an extensive empirical analysis (in many ways complemen-\ntary to the description-oriented [10]) of our method in three aspects: bilingual\ntopic representation, word alignment, and translation.\n\n1 Introduction\nMost contemporary SMT systems view parallel data as independent sentence-pairs whether or\nnot they are from the same document-pair. Consequently, translation models are learned only at\nsentence-pair level, and document contexts \u2013 essential factors for translating documents \u2013 are gen-\nerally overlooked. Indeed, translating documents differs considerably from translating a group of\nunrelated sentences. A sentence, when taken out of the context from the document, is generally more\nambiguous and less informative for translation. One should avoid destroying a coherent document\nby simply translating it into a group of sentences which are indifferent to each other and detached\nfrom the context.\n\nDevelopments in statistics, genetics, and machine learning have shown that latent semantic aspects\nof complex data can often be captured by a model known as the statistical admixture (or mixed\nmembership model [4]). Statistically, an object is said to be derived from an admixture if it consists\nof a bag of elements, each sampled independently or coupled in a certain way, from a mixture\nmodel. In the context of SMT, each parallel document-pair is treated as one such object. Depending\non the chosen modeling granularity, all sentence-pairs or word-pairs in a document-pair correspond\nto the basic elements constituting the object, and the mixture from which the elements are sampled\ncan correspond to a collection of translation lexicons and monolingual word frequencies based on\ndifferent topics (e.g., economics, politics, sports, etc.). Variants of admixture models have appeared\nin population genetics [6] and text modeling [1, 4].\nRecently, a Bilingual Topic-AdMixture (BiTAM) model was proposed to capture the topical aspects\nof SMT [10]; word-pairs from a parallel document-pair follow the same weighted mixtures of trans-\nlation lexicons, inferred for the given document-context. The BiTAMs generalize over IBM Model-\n1; they are ef\ufb01cient to learn and scalable for large training data. However, they do not capture locality\n\n1\n\n\fconstraints of word alignment, i.e., words \u201cclose-in-source\u201d are usually aligned to words \u201cclose-in-\ntarget\u201d, under document-speci\ufb01c topical assignment. To incorporate such constituents, we integrate\nthe strengths of both HMM and BiTAM, and propose a Hidden Markov Bilingual Topic-AdMixture\nmodel, or HM-BiTAM, for word alignment to leverage both locality constraints and topical context\nunderlying parallel document-pairs.\n\nIn the HM-BiTAM framework, one can estimate topic-speci\ufb01c word-to-word translation lexicons\n(lexical mappings), as well as the monolingual topic-speci\ufb01c word-frequencies for both languages,\nbased on parallel document-pairs. The resulting model offers a principled way of inferring optimal\ntranslation from a given source language in a context-dependent fashion. We report an extensive\nempirical analysis of HM-BiTAM, in comparison with related methods. We show our model\u2019s ef-\nfectiveness on the word-alignment task; we also demonstrate two application aspects which were\nuntouched in [10]: the utility of HM-BiTAM for bilingual topic exploration, and its application for\nimproving translation qualities.\n\n2 Revisit HMM for SMT\n\nAn SMT system can be formulated as a noisy-channel model [2]:\n\ne\u2217 = arg max\n\nP (e|f ) = arg max\n\nP (f |e)P (e),\n\ne\n\ne\n\n(1)\n\nwhere a translation corresponds to searching for the target sentence e\u2217 which explains the source\nsentence f best. The key component is P (f |e), the translation model; P (e) is monolingual language\nmodel. In this paper, we generalize P (f |e) with topic-admixture models.\nAn HMM implements the \u201cproximity-bias\u201d assumption \u2014 that words \u201cclose-in-source\u201d are aligned\nto words \u201cclose-in-target\u201d, which is effective for improving word alignment accuracies, especially\nfor linguistically close language-pairs [8]. Following [8], to model word-to-word translation, we\nintroduce the mapping j \u2192 aj, which assigns a French word fj in position j to an English word\nei in position i = aj denoted as eaj . Each (ordered) French word fj is an observation, and it is\ngenerated by an HMM state de\ufb01ned as [eaj , aj], where the alignment indicator aj for position j is\nconsidered to have a dependency on the previous alignment aj\u22121. Thus a \ufb01rst-order HMM for an\nalignment between e \u2261 e1:I and f \u2261 f1:J is de\ufb01ned as:\n\np(f1:J |e1:I ) = X\n\nY\n\np(fj|eaj )p(aj|aj\u22121),\n\nJ\n\na1:J\n\nj=1\n\n(2)\n\nwhere p(aj|aj\u22121) is the state transition probability; J and I are sentence lengths of the French and\nEnglish sentences, respectively. The transition model enforces the proximity-bias. An additional\npseudo word \u201dNULL\u201d is used at the beginning of English sentences for HMM to start with. The\nHMM implemented in GIZA++ [5] is used as our baseline, which includes re\ufb01nements such as\nspecial treatment of a jump to a NULL word. A graphical model representation for such an HMM\nis illustrated in Figure 1 (a).\n\nIm,n\n\nem,i\n\nB = p(f |e)\n\n\u03b1\n\n\u03b8m\n\nzm,n\n\n\u03b2k\n\nK\n\nIm,n\n\nem,i\n\nfm,1\n\nfm,2\n\nfm,3\n\nfJm,n\n\nfm,1\n\nfm,2\n\nfm,3\n\nfJm,n\n\nam,1\n\nam,2\n\nam,3\n\naJm,n\n\nam,1\n\nam,2\n\nam,3\n\naJm,n\n\nNm\n\nM\n\nBk\n\nK\n\nNm\nM\n\nTi,i\u2032\n\n(a) HMM for Word Alignment\n\nTi,i\u2032\n\n(b) HM-BiTAM\n\nFigure 1: The graphical model representations of (a) HMM, and (b) HM-BiTAM, for parallel corpora. Circles\nrepresent random variables, hexagons denote parameters, and observed variables are shaded.\n\n2\n\n\f3 Hidden Markov Bilingual Topic-AdMixture\nWe assume that in training corpora of bilingual documents, the document-pair boundaries are\nknown, and indeed they serve as the key information for de\ufb01ning document-speci\ufb01c topic weights\nunderlying aligned sentence-pairs or word-pairs. To simplify the outline, the topics here are sam-\npled at sentence-pair level; topics sampled at word-pair level can be easily derived following the\noutlined algorithms, in the same spirit of [10]. Given a document-pair (F, E) containing N parallel\nsentence-pairs (en, fn), HM-BiTAM implements the following generative scheme.\n\n3.1 Generative Scheme of HM-BiTAM\nGiven a conjugate prior Dirichlet(\u03b1), the topic-weight vector (hereafter, TWV), \u03b8m for each\ndocument-pair (Fm, Em), is sampled independently. Let the non-underscripted \u03b8 denote the TWV\nof a typical document-pair (F, E), a collection of topic-speci\ufb01c translation lexicons be B \u2261 {Bk},\nwhere Bi,j,k=P (f =fj|e=ei, z=k) is the conditional probability of translating e into f under a\ngiven topic indexed by z; the topic-speci\ufb01c monolingual model \u03b2 \u2261 {\u03b2k}, which can be the usual\nLDA-style monolingual unigrams. The sentence-pairs {fn, en} are drawn independently from a\nmixture of topics. Speci\ufb01cally (as illustrated also in Fig. 1 (b)):\n\n1. \u03b8 \u223c Dirichlet(\u03b1)\n2. For each sentence-pair (fn, en),\n\n(a) zn \u223c Multinomial(\u03b8)\n(b) en,1:In |zn \u223c P (en|zn; \u03b2)\n\nmodel (e.g., an unigram model),\n\nsample the topic\n\nsample all English words from a monolingual topic\n\n(c) For each position jn = 1, . . . , Jn in fn,\n\ni. ajn \u223c P (ajn |ajn\u22121;T )\n\nprocess,\n\nsample an alignment link ajn from a \ufb01rst-order Markov\n\nii. fjn \u223c P (fjn |en, ajn , zn; B)\n\nspeci\ufb01c translation lexicon.\n\nsample a foreign word fjn according to a topic\n\nUnder an HM-BiTAM model, each sentence-pair consists of a mixture of latent bilingual topics;\neach topic is associated with a distribution over bilingual word-pairs. Each word f is generated by\ntwo hidden factors: a latent topic z drawn from a document-speci\ufb01c distribution over K topics, and\nthe English word e identi\ufb01ed by the hidden alignment variable a.\n\n3.2 Extracting Bilingual Topics from HM-BiTAM\n\nBecause of the parallel nature of the data, the topics of English and the foreign language will share\nsimilar semantic meanings. This assumption is captured in our model. Shown in Figure 1(b), both\nthe English and foreign topics are sampled from the same distribution \u03b8, which is a document-\nspeci\ufb01c topic-weight vector.\n\nAlthough there is an inherent asymmetry in the bilingual topic representation in HM-BiTAM (that\nthe monolingual topic representations \u03b2 are only de\ufb01ned for English, and the foreign topic represen-\ntations are implicit via the topical translation models), it is not dif\ufb01cult to retrieve the monolingual\ntopic representations of the foreign language via a marginalization over hidden word alignment. For\nexample, the frequency (i.e., unigram) of foreign word fw under topic k can be computed by\n\nP (fw|k) = X\n\nP (fw|e, Bk)P (e|\u03b2k).\n\ne\n\n(3)\n\nAs a result, HM-BiTAM can actually be used as a bilingual topic explorer in the LDA-style and\nbeyond. Given paired documents, it can extract the representations of each topic in both languages\nin a consistent fashion (which is not guaranteed if topics are extracted separately from each language\nusing, e.g., LDA), as well as the lexical mappings under each topics, based on a maximal likelihood\nor Bayesian principle. In Section 5.2, we demonstrate outcomes of this application.\n\nWe expect that, under the HM-BiTAM model, because bilingual statistics from word alignment a\nare shared effectively across different topics, a word will have much less translation candidates due\nto constraints by the hidden topics; therefore the topic speci\ufb01c translation lexicons are much smaller\nand sharper, which give rise to a more parsimonious and unambiguous translation model.\n\n3\n\n\f4 Learning and Inference\nWe sketch a generalized mean-\ufb01eld approximation scheme for inferring latent variables in HM-\nBiTAM, and a variational EM algorithm for estimating model parameters.\n\np(F, E, \u03b8, ~z, ~a|\u03b1, \u03b2, T, B) = p(\u03b8|\u03b1)P (~z|\u03b8)P (~a|T )P (F|~a, ~z, E, B)P (E|~z, \u03b2),\n\n4.1 Variational Inference\nUnder HM-BiTAM, the complete likelihood of a document-pair (F, E) can be expressed as follows:\n(4)\nwhere P (~a|T )= QN\nj=1 P (ajn |ajn\u22121; T ) represents the probability of a sequence of align-\nment jumps; P (F|~a, ~z, E, B)= QN\nj=1 P (fjn |ajn , en, zn, B) is the document-level translation\nprobability; and P (E|~z, \u03b2) is the topic-conditional likelihood of the English document based on a\ntopic-dependent unigram as used in LDA. Apparently, exact inference under this model is infeasible\nas noted in earlier models related to, but simpler than, this one [10].\n\nn=1 QJn\n\nn=1 QJn\n\nTo approximate the posterior p(~a, \u03b8, ~z|F, E), we employ a generalized mean \ufb01eld approach and\nadopt the following factored approximation to the true posterior: q(\u03b8, ~z, ~a) = q(\u03b8|~\u03b3)q(~z|~\u03c6)q(~a|~\u03bb),\nwhere q(\u03b8|~\u03b3), q(~z|~\u03c6), and q(~a|~\u03bb) are re-parameterized Dirichlet, multinomial, and HMM, respec-\ntively, determined by some variational parameters that correspond to the expected suf\ufb01cient statis-\ntics of the dependent variables of each factor [9].\n\nAs well known in the variational inference literature, solutions to the above variational param-\neters can be obtained by minimizing the Kullback-Leibler divergence between q(\u03b8, ~z, ~a) and\np(\u03b8, ~z, ~a|F, E), or equivalently, by optimizing the lower-bound of the expected (over q()) log-\nlikelihood de\ufb01ned by Eq.(4), via a \ufb01xed-point iteration. Due to space limit, we forego a detailed\nderivation, and directly give the \ufb01xed-point equations below:\n\n\u02c6\u03b3k = \u03b1k +\n\nN\n\nX\n\nn=1\n\n\u03c6n,k,\n\n\u02c6\u03c6n,k \u221d exp\u201c\u03a8(\u03b3k) \u2212 \u03a8(\n\nK\n\nX\n\nk=1\n\n\u03b3k)\u201d \u00b7 exp\u201c\n\nIn\n\nX\n\ni=1\n\nJn\n\nX\n\nj=1\n\n\u03bbn,j,i log \u03b2k,ein\u201d\n\nJn ,In\n\n1(fjn , f )1(ein , e)\u03bbn,j,ilog Bf ,e,k\u201d,\n\n\u00d7 exp\u201c\n\nX\n\nX\n\nX\n\nj,i=1\n\nf \u2208VF\n\ne\u2208VE\n\n(5)\n\n(6)\n\n\u02c6\u03bbn,j,i \u221d exp\u201c\n\nIn\n\nX\n\n\u2032 =1\ni\n\n\u03bbn,j\u22121,i\n\n\u2032 log Ti,i\n\n\u2032\u201d \u00d7 exp\u201c\n\nIn\n\nX\n\ni\u201d=1\n\n\u03bbn,j+1,i\u201d log Ti\u201d,i\u201d\n\n\u00d7 exp\u201cX\n\nX\n\nf \u2208VF\n\ne\u2208VE\n\n1(fjn ,f )1(ein ,e)\n\nK\n\nX\n\nk=1\n\n\u03c6n,k log Bf ,e,k\u201d \u00d7 exp\u201c\n\nK\n\nX\n\nk=1\n\n\u03c6n,k log \u03b2k,ein\u201d,\n\n(7)\n\nwhere 1(\u00b7, \u00b7) denotes an indicator function, and \u03a8(\u00b7) represents the digamma function.\nThe vector \u02c6\u03c6n \u2261 ( \u02c6\u03c6n,1, . . . , \u02c6\u03c6n,K) given by Eq. (6) represents the approximate posterior of the\ntopic weights for each sentence-pair (fn, en). The topical information for updating \u02c6\u03c6n is collected\nfrom three aspects: aligned word-pairs weighted by the corresponding topic-speci\ufb01c translation lex-\nicon probabilities, topical distributions of monolingual English language model, and the smoothing\nfactors from the topic prior.\n\nEquation (7) gives the approximate posterior probability for alignment between the j-th word in\nfn and the i-th word in en, in the form of an exponential model. Intuitively, the \ufb01rst two terms\nrepresent the messages corresponding to the forward and the backward passes in HMM; The third\nterm represents the emission probabilities, and it can be viewed as a geometric interpolation of the\nstrengths of individual topic-speci\ufb01c lexicons; and the last term provides further smoothing from\nmonolingual topic-speci\ufb01c aspects.\n\nInference of optimum word-alignment One of the translation model\u2019s goals is to infer the op-\ntimum word alignment: a\u2217 = arg maxa P (a|F, E). The variational inference scheme described\nabove leads to an approximate alignment posterior q(~a|~\u03bb), which is in fact a reparameterized HMM.\nThus, extracting the optimum alignment amounts to applying an Viterbi algorithm on q(~a|~\u03bb).\n\n4\n\n\f4.2 Variational EM for parameter estimation\n\nTo estimate the HM-BiTAM parameters, which include the Dirichlet hyperparameter \u03b1,\nthe\ntransition matrix T , the topic-speci\ufb01c monolingual English unigram {~\u03b2k}, and the topic-speci\ufb01c\ntranslation lexicon {Bk}, we employ an variational EM algorithm which iterates between com-\nputing variational distribution of the hidden variables (the E-step) as described in the previous\nsubsection, and optimizing the parameters with respect to the variational likelihood (the M-step).\nHere are the update equations for the M-step:\n\n\u02c6Ti\u201d,i\n\n\u2032 \u221d\n\nN\n\nX\n\nn=1\n\nJn\n\nX\n\nj=1\n\n\u03bbn,j,i\u201d\u03bbn,j\u22121,i\n\n\u2032 ,\n\nBf,e,k \u221d\n\nN\n\nX\n\nn=1\n\nJn\n\nX\n\nj=1\n\nIn\n\nK\n\nX\n\ni=1\n\nX\n\nk=1\n\n1(fjn , f )1(ein , e)\u03bbn,j,i\u03c6n,k,\n\n\u03b2k,e \u221d\n\nN\n\nIn\n\nX\n\nn=1\n\nX\n\ni=1\n\nJn\n\nX\n\nj=1\n\n1ei ,e\u03bbnji\u03c6n,k.\n\n(8)\n\n(9)\n\n(10)\n\nFor updating Dirichlet hyperparameter \u03b1, which is a corpora-level parameter, we resort to gradient\naccent as in [7]. The overall computation complexity of the model is linear to the number of topics.\n\n5 Experiments\nIn this section, we investigate three main aspects of the HM-BiTAM model, including word align-\nment, bilingual topic exploration, and machine translation.\n\nTrain\n\n#Doc.\n\n#Sent.\n\n#Tokens\n\nEnglish\n\nChinese\n\nTreeBank\nSinorama04\nSinorama02\nChnews.2005\nFBIS.BEIJING\nXinHua.NewsStory\n\n316\n6367\n2373\n1001\n6111\n17260\n\n4172\n282176\n103252\n10317\n99396\n98444\n\n133,598\n10,321,061\n3,810,664\n326,347\n4,199,030\n3,807,884\n\nALL\n\n33,428\n\n22,598,584\nTable 1: Training data statistics.\n\n597,757\n\n105,331\n10,027,095\n3,146,014\n270,274\n3,527,786\n3,915,267\n\n20,991,767\n\nThe training data is a collection of parallel document-pairs, with document boundaries explicitly\ngiven. As shown in Table 1, our training corpora are general newswire, covering topics mainly about\neconomics, politics, educations and sports. For word-alignment evaluation, our test set consists of\n95 document-pairs, with 627 manually-aligned sentence-pairs and 14,769 alignment-links in total,\nfrom TIDES\u201901 dryrun data. Word segmentations and tokenizations were \ufb01xed manually for optimal\nword-alignment decisions. This test set contains relatively long sentence-pairs, with an average\nsentence length of 40.67 words. The long sentences introduce more ambiguities for alignment tasks.\n\nFor testing translation quality, TIDES\u201902 MT evaluation data is used as development data, and\nten documents from TIDES\u201904 MT-evaluation are used as the unseen test data. BLEU scores are\nreported to evaluate translation quality with HM-BiTAM models.\n\n5.1 Empirical Validation\nWord Alignment Accuracy We trained HM-BiATMs with ten topics using parallel corpora of\nsizes ranging from 6M to 22.6M words; we used the F-measure, the harmonic mean of precision\nand recall, to evaluate word-alignment accuracy. Following the same logics for all BiTAMs in [10],\nwe choose HM-BiTAM in which topics are sampled at word-pair level over sentence-pair level. The\nbaseline IBM models were trained using a 18h543 scheme 2. Re\ufb01ned alignments are obtained from\nboth directions of baseline models in the same way as described in [5].\n\nFigure 2 shows the alignment accuracies of HM-BiTAM, in comparison with that of the baseline-\nHMM, the baseline BiTAM, and the IBM Model-4. Overall, HM-BiTAM gives signi\ufb01cantly better\nF-measures over HMM, with absolute margins of 7.56%, 5.72% and 6.91% on training sizes of\n\n2Eight iterations for IBM Model-1, \ufb01ve iterations for HMM, and three iterations for IBM Model-4 (with\n\nde\ufb01cient EM: normalization factor is computed using sampled alignment neighborhood in E-step)\n\n5\n\n\fHMM\n\nBiTAM\n\nIBM-4\n\nHM-BiTAM\n\n66\n\n64\n\n62\n\n60\n\n58\n\n56\n\n54\n\n52\n\n50\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\nc\no\nd\n \nr\ne\np\n \n)\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n(\ng\no\n\u2212\n\nl\n\n \n:\n\ni\n\nM\nA\nT\nB\n\u2212\nM\nH\n\nc\no\nd\n \nr\ne\np\n \n)\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n(\ng\no\n\u2212\n\nl\n\n \n:\n\ni\n\nM\nA\nT\nB\n\u2212\nM\nH\n\nNegative log\u2212likehood: HM\u2212BiTAM (y\u2212axis) vs IBM Model\u22124 (x\u2212axis) & HMM (x\u2212axis)\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n4500\n\n5000\n\nIBM Model\u22124 (with deficient EM)\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n4500\n\n5000\n\nHMM (with forward\u2212backward EM)\n\n6M\n\n11M\n\n22.6M\n\nFigure 2: Alignment accuracy (F-measure) of differ-\nent models trained on corpora of different sizes.\n\nFigure 3: Comparison of likelihoods of data under\ndifferent models. Top: HM-BiTAM v.s. IBM Model-\n4; bottom: HM-BiTAM v.s. HMM.\n\n6 M, 11 M and 22.6 M words, respectively. In HM-BiTAM, two factors contribute to narrowing\ndown the word-alignment decisions: the position and the lexical mapping. The position part is\nthe same as the baseline-HMM, implementing the \u201cproximity-bias\u201d. Whereas the emission lexical\nprobability is different, each state is a mixture of topic-speci\ufb01c translation lexicons, of which the\nweights are inferred using document contexts. The topic-speci\ufb01c translation lexicons are sharper\nand smaller than the global one used in HMM. Thus the improvements of HM-BiTAM over HMM\nessentially resulted from the extended topic-admixture lexicons. Not surprisingly, HM-BiTAM also\noutperforms the baseline-BiTAM signi\ufb01cantly, because BiTAM captures only the topical aspects\nand ignores the proximity bias.\n\nNotably, HM-BiTAM also outperforms IBM Model-4 by a margin of 3.43%, 3.64% and 2.73%,re-\nspectively. Overall, with 22.6 M words, HM-BiTAM outperforms HMM, BiTAM, IBM-4 signi\ufb01-\ncantly, p=0.0031, 0.0079, 0.0121, respectively. IBM Model-4 already integrates the fertility and\ndistortion submodels on top of HMM, which further narrows the word-alignment choices. However,\nIBM Model-4 does not have a scheme to adjust its lexicon probabilities speci\ufb01c to document topical-\ncontext as in HM-BiTAM. In a way, HM-BiTAM wins over IBM-4 by leveraging topic models that\ncapture the document context.\nLikelihood on Training and Unseen Documents Figure 3 shows comparisons of the likelihoods\nof document-pairs in the training set under HM-BiTAM with those under IBM Model-4 or HMM.\nEach point in the \ufb01gure represents one document-pair; the y-coordinate corresponds to the negative\nlog-likelihood under HM-BiTAM, and the x-coordinate gives the counterparts under IBM Model-4\nor HMM. Overall the likelihoods under HM-BiTAM are signi\ufb01cantly better than those under HMM\nand IBM Model-4, revealing the better modeling power of HM-BiTAM.\n\nWe also applied HM-BiTAM to ten document-pairs selected from MT04, which were not included in\nthe training. These document-pairs contain long sentences and diverse topics. As shown in Table 2,\nthe likelihoods of HM-BiTAM on these unseen data dominates signi\ufb01cantly over that of HMM,\nBiTAM, and IBM Models in every case, con\ufb01rming that HM-BiTAM indeed offers a better \ufb01t and\ngeneralizability for the bilingual document-pairs.\n\nPublishers\n\nGenre\n\nIBM-1\n\nHMM\n\nIBM-4\n\nBiTAM\n\nHM-BiTAM\n\nAgenceFrance(AFP)\nAgenceFrance(AFP)\nAgenceFrance(AFP)\nForeignMinistryPRC\nHongKongNews\nPeople\u2019s Daily\nUnited Nation\nXinHua News\nXinHua News\nZaoBao News\n\nAvg. Perplexity\n\nnews\nnews\nnews\nspeech\nspeech\neditorial\nspeech\nnews\nnews\neditorial\n\n-3752.94\n-3341.69\n-2527.32\n-2313.28\n-2198.13\n-2485.08\n-2134.34\n-2425.09\n-2684.85\n-2376.12\n\n-3388.72\n-2899.93\n-2124.75\n-1913.29\n-1822.25\n-2094.90\n-1755.11\n-2030.57\n-2326.39\n-2047.55\n\n-3448.28\n-3005.80\n-2161.31\n-1963.24\n-1890.81\n-2184.23\n-1821.29\n-2114.39\n-2352.62\n-2116.42\n\n123.83\n\n60.54\n\n68.41\n\n-3602.28\n-3139.95\n-2323.11\n-2144.12\n\n-2035\n-2377.1\n-1949.39\n-2192.9\n-2527.78\n-2235.79\n\n107.57\n\n-3188.90\n-2595.72\n-2063.69\n-1669.22\n-1423.84\n-1867.13\n-1431.16\n-1991.31\n-2317.47\n-1943.25\n\n43.71\n\nTable 2: Likelihoods of unseen documents under HM-BiTAMs, in comparison with competing models.\n\n5.2 Application 1: Bilingual Topic Extraction\nMonolingual topics: HM-BiTAM facilitates inference of the latent LDA-style representations of\ntopics [1] in both English and the foreign language (i.e., Chinese) from a given bilingual corpora.\nThe English topics (represented by the topic-speci\ufb01c word frequencies) can be directly read-off\nfrom HM-BiTAM parameters \u03b2. As discussed in \u00a7 3.2, even though the topic-speci\ufb01c distributions\n\n6\n\n\fof words in the Chinese corpora are not directly encoded in HM-BiTAM, one can marginalize over\nalignments of the parallel data to synthesize them based on the monolingual English topics and the\ntopic-speci\ufb01c lexical mapping from English to Chinese.\n\nFigure 4 shows \ufb01ve topics, in both English and Chinese, learned via HM-BiTAM. The top-ranked\nfrequent words in each topic exhibit coherent semantic meanings; and there are also consistencies\nbetween the word semantics under the same topic indexes across languages. Under HM-BiTAM,\nthe two respective monolingual word-distributions for the same topic are statistically coupled due\nto sharing of the same topic for each sentence-pair in the two languages. Whereas if one merely\napply LDA to the corpora in each language separately, such coupling can not be exploited. This\ncoupling enforces consistency between the topics across languages. However, like general clustering\nalgorithms, topics in HM-BiTAM, are not necessarily to present obvious semantic labels.\n\n\u201csports\u201d \n\n(cid:1166)(people)\n\n(cid:8543)(cid:11154)(handicapped)\n\n(cid:1319)(cid:13958)(sports)\n(cid:1119)(cid:1006)(career)\n\n(cid:8712)(water)\n\n(cid:1002)(cid:11040)(world)\n(cid:2318)(region) \n\n(cid:7044)(cid:2338)(cid:12050)(Xinhua)\n\n(cid:19443)(cid:2604)(team member) \n\n(cid:16772)(cid:13785)(reporter)\n\n\u201chousing\u201d\n\n(cid:1315)(cid:6163)(house)\n(cid:6163)(house)\n\n(cid:1073)(cid:8755) (JiuJiang) \n(cid:5326)(cid:16786)(construction)\n\n(cid:9607)(cid:19388)(macao) \n\n(cid:1815)(Yuan)\n\n(cid:13856)(cid:5049)(workers) \n(cid:11458)(cid:2081)(current) \n(cid:3281)(cid:4490)(national) \n(cid:11477)(province)\n\n\u201cstocks\u201d\n\n(cid:9157)(cid:3335)(shenzhen)\n(cid:9157)(shen zhen) \n(cid:7044)(Singarpore) \n\n(cid:1815)(Yuan)\n(cid:13941)(stock) \n\n(cid:20333)(cid:9219)(Hongkong)\n(cid:3281)(cid:7389)(state-owned) \n\n(cid:3818)(cid:17176)(foreign\ninvestiment) \n\n(cid:7044)(cid:2338)(cid:12050)(Xinhua)\n(cid:15713)(cid:17176)(refinancing)\n\n\u201cenergy\u201d\n\n(cid:1856)(cid:2508)(company) \n\n(cid:3837)(cid:9994)(cid:8680)(gas)\n\n(cid:1016)(two)\n\n(cid:3281)(countries)\n(cid:13666)(cid:3281)(U.S.)\n\n(cid:16772)(cid:13785)(reporters)\n(cid:1863)(cid:13007)(relations)\n\n(cid:1432)(Russian)\n(cid:8873)(France) \n\n(cid:18337)(cid:5210)(ChongQing)\n\n\u201ctakeover\u201d\n\n(cid:3281)(cid:4490)(countries)\n(cid:18337)(cid:5210)(ChongQing)\n\n(cid:2390)(Factory) \n(cid:3837)(cid:8953)(TianJin) \n\n(cid:6931)(cid:5232)(Government)\n\n(cid:20045)(cid:11458)(project) \n(cid:3281)(cid:7389)(national) \n(cid:9157)(cid:3335)(Shenzhen)\n(cid:1872)(cid:5194)(take over) \n\n(cid:6922)(cid:17153)(buy)\n\nFigure 4: Monolingual topics of both languages learned from parallel data. It appears that the English topics\n(on the left panel) are highly parallel to the Chinese ones (annotated with English gloss, on the right panel).\n\nTopic-Speci\ufb01c Lexicon Mapping: Table 3 shows two examples of topic-speci\ufb01c lexicon mapping\nlearned by HM-BiTAM. Given a topic assignment, a word usually has much less translation candi-\ndates, and the topic-speci\ufb01c translation lexicons are generally much smaller and sharper. Different\ntopic-speci\ufb01c lexicons emphasize different aspects of translating the same source words, which can\nnot be captured by the IBM models or HMM. This effect can be observed from Table 3.\n\nTopCand\n\nTopCand\n\nTopics\n\nTopic-1\nTopic-2\nTopic-3\nTopic-4\nTopic-5\nTopic-6\nTopic-7\nTopic-8\nTopic-9\nTopic-10\nIBM Model-1\nHMM\nIBM Model-4\n\n\u201cmeet\u201d\nMeaning\n\nsports meeting\n\nto satisfy\nto adapt\nto adjust\n\nto see someone\n\n-\n\nto satisfy\n\nsports meeting\n\n-\n\nto see someone\nsports meeting\nsports meeting\nsports meeting\n\nProbability\n0.508544\n0.160218\n0.921168\n0.996929\n0.693673\n\n-\n\n0.467555\n0.487728\n\n-\n\n0.551466\n0.590271\n0.72204\n0.608391\n\n\u201cpower\u201d\nMeaning\n\nelectric power\n\nelectricity factory\n\nto be relevant\n\nstrength\nstrength\n\n-\n\nElectric watt\n\npower\n\nto generate\n\nstrength\n\npower plant\n\nstrength\nstrength\n\nProbability\n0.565666\n\n0.656\n\n0.985341\n0.410503\n0.997586\n\n-\n\n0.613711\n\n1.0\n\n0.50457\n\n1.0\n\n0.314349\n0.51491\n0.506258\n\nTable 3: Topic-speci\ufb01c translation lexicons learned by HM-BiTAM. We show the top candidate (TopCand)\nlexicon mappings of \u201cmeet\u201d and \u201cpower\u201d under ten topics. (The symbol \u201c-\u201d means inexistence of signi\ufb01cant\nlexicon mapping under that topic.) Also shown are the semantic meanings of the mapped Chinese words, and\nthe mapping probability p(f |e, k).\n\n5.3 Application 2: Machine Translation\nThe parallelism of topic-assignment between languages modeled by HM-BiTAM, as shown in \u00a7 3.2\nand exempli\ufb01ed in Fig. 4, enables a natural way of improving translation by exploiting semantic\nconsistency and contextual coherency more explicitly and aggressively. Under HM-BiTAM, given\na source document DF , the predictive probability distribution of candidate translations of every\nsource word, P (e|f, DF ), must be computed by mixing multiple topic-speci\ufb01c translation lexicons\naccording to the topic weights p(z|DF ) determined from monolingual context in DF . That is:\n\nP (e|f, DF ) \u221d P (f |e, DF )P (e|DF )=\n\nK\n\nX\n\nk=1\n\nP (f |e, z = k)P (e|z = k)P (z = k|DF ).\n\n(11)\n\nWe used p(e|f, DF ) to score the bilingual phrase-pairs in a state-of-the-art GALE translation system\ntrained with 250 M words. We kept all other parameters the same as those used in the baseline. Then\ndecoding of the unseen ten MT04 documents in Table 2 was carried out.\n\n7\n\n\u00c4\u00ac\n\u00f7v\n\u0003A\n\u0006(cid:18)\n\u00ac\u201e\n\u00f7v\n\u00c4\u00ac\n\u00ac\u201e\n\u00c4\u00ac\n\u00c4\u00ac\n\u00c4\u00ac\n\n-\n\n-\n\n>\u00e5\n>\u201a\n(cid:21)9\n\u00e5\u0002\n\u00e5\u0002\n\u0006\n\u00a2\u00e5\n\u00d1\n\u00e5\u0002\n>\u201a\n\u00e5\u0002\n\u00e5\u0002\n\n-\n\n\fSystems\nHiero Sys.\nGale Sys.\nHM-BiTAM\nGround Truth\n\n1-gram 2-gram 3-gram 4-gram\n13.84\n73.92\n14.30\n75.63\n14.56\n76.77\n76.10\n15.73\n\n40.57\n42.71\n42.99\n43.85\n\n23.21\n25.00\n25.42\n26.70\n\nBLEUr4n4\n\n30.70\n32.78\n33.19\n34.17\n\nTable 4: Decoding MT04 10-documents. Experiments using the topic assignments inferred from ground truth\nand the ones inferred via HM-BITAM; ngram precisions together with \ufb01nal BLEUr4n4 scores are evaluated.\n\nTable 4 shows the performance of our in-house Hiero system (following [3]), the state-of-the-art\nGale-baseline (with a better BLEU score), and our HM-BiTAM model, on the NIST MT04 test\nset. If we know the ground truth of translation to infer the topic-weights, improvement is from\n32.78 to 34.17 BLEU points. With topical inference from HM-BiTAM using monolingual source\ndocument, improved N-gram precisions in the translation were observed from 1-gram to 4-gram.\nThe largest improved precision is for unigram: from 75.63% to 76.77%. Intuitively, unigrams have\npotentially more ambiguities for translations than the higher order ngrams, because the later ones\nencode already contextual information. The overall BLEU score improvement of HM-BiTAM over\nother systems, including the state-of-the-art, is from 32.78 to 33.19, an slight improvement with\np = 0.043.\n6 Discussion and Conclusion\nWe presented a novel framework, HM-BiTAM, for exploring bilingual topics, and generalizing over\ntraditional HMM for improved word-alignment accuracies and translation quality. A variational in-\nference and learning procedure was developed for ef\ufb01cient training and application in translation.\nWe demonstrated signi\ufb01cant improvement of word-alignment accuracy over a number of existing\nsystems, and the interesting capability of HM-BiTAM to simultaneously extract coherent monolin-\ngual topics from both languages. We also report encouraging improvement of translation quality\nover current benchmarks; although the margin is modest, it is noteworthy that the current version of\nHM-BiTAM remains a purely autonomously trained system. Future work also includes extensions\nwith more structures for word-alignment such as noun phrase chunking.\n\nReferences\n\n[1] David Blei, Andrew NG, and Michael I. Jordon. Latent dirichlet allocation.\n\nLearning Research, volume 3, pages 1107\u20131135, 2003.\n\nIn Journal of Machine\n\n[2] Peter F. Brown, Stephen A. Della Pietra, Vincent. J. Della Pietra, and Robert L. Mercer. The mathematics\nIn Computational Linguistics, volume 19(2),\n\nof statistical machine translation: Parameter estimation.\npages 263\u2013331, 1993.\n\n[3] David Chiang. A hierarchical phrase-based model for statistical machine translation. In Proceedings of\nthe 43rd Annual Meeting of the Association for Computational Linguistics (ACL\u201905), pages 263\u2013270, Ann\nArbor, Michigan, June 2005. Association for Computational Linguistics.\n\n[4] Elena Erosheva, Steve Fienberg, and John Lafferty. Mixed membership models of scienti\ufb01c publications.\n\nIn Proceedings of the National Academy of Sciences, volume 101 of Suppl. 1, April 6 2004.\n\n[5] Franz J. Och and Hermann Ney. The alignment template approach to statistical machine translation. In\n\nComputational Linguistics, volume 30, pages 417\u2013449, 2004.\n\n[6] J. Pritchard, M. Stephens, and P. Donnell. Inference of population structure using multilocus genotype\n\ndata. In Genetics, volume 155, pages 945\u2013959, 2000.\n\n[7] K. Sj\u00a8olander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I.S. Mian, and D. Haussler. Dirichlet mix-\ntures: A method for improving detection of weak but signi\ufb01cant protein sequence homology. Computer\nApplications in the Biosciences, 12, 1996.\n\n[8] Stephan. Vogel, Hermann Ney, and Christoph Tillmann. HMM based word alignment in statistical ma-\nchine translation. In Proc. The 16th Int. Conf. on Computational Lingustics, (Coling\u201996), pages 836\u2013841,\nCopenhagen, Denmark, 1996.\n\n[9] Eric P. Xing, M.I. Jordan, and S. Russell. A generalized mean \ufb01eld algorithm for variational inference\nin exponential families. In Meek and Kjaelff, editors, Uncertainty in Arti\ufb01cial Intelligence (UAI2003),\npages 583\u2013591. Morgan Kaufmann Publishers, 2003.\n\n[10] Bing Zhao and Eric P. Xing. Bitam: Bilingual topic admixture models for word alignment. In Proceedings\n\nof the 44th Annual Meeting of the Association for Computational Linguistics (ACL\u201906), 2006.\n\n8\n\n\f", "award": [], "sourceid": 188, "authors": [{"given_name": "Bing", "family_name": "Zhao", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}