{"title": "Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages", "book": "Advances in Neural Information Processing Systems", "page_first": 8653, "page_last": 8663, "abstract": "Multilingual topic models can reveal patterns in cross-lingual document collections. However, existing models lack speed and interactivity, which prevents adoption in everyday corpora exploration or quick moving situations (e.g., natural disasters, political instability). First, we propose a multilingual anchoring algorithm that builds an anchor-based topic model for documents in different languages. Then, we incorporate interactivity to develop MTAnchor (Multilingual Topic Anchors), a system that allows users to refine the topic model. We test our algorithms on labeled English, Chinese, and Sinhalese documents. Within minutes, our methods can produce interpretable topics that are useful for specific classification tasks.", "full_text": "Multilingual Anchoring: Interactive Topic Modeling\n\nand Alignment Across Languages\n\nMichelle Yuan\n\nUniversity of Maryland\nmyuan@cs.umd.edu\n\nBenjamin Van Durme\nJohn Hopkins University\nvandurme@jhu.edu\n\nJordan Boyd-Graber\nUniversity of Maryland\n\njbg@umiacs.umd.edu\n\nAbstract\n\nMultilingual topic models can reveal patterns in cross-lingual document collections.\nHowever, existing models lack speed and interactivity, which prevents adoption in\neveryday corpora exploration or quick moving situations (e.g., natural disasters,\npolitical instability). First, we propose a multilingual anchoring algorithm that\nbuilds an anchor-based topic model for documents in different languages. Then,\nwe incorporate interactivity to develop MTAnchor (Multilingual Topic Anchors),\na system that allows users to re\ufb01ne the topic model. We test our algorithms on\nlabeled English, Chinese, and Sinhalese documents. Within minutes, our methods\ncan produce interpretable topics that are useful for speci\ufb01c classi\ufb01cation tasks.\n\n1\n\nIntroduction: Exploring multilingual document collections\n\nModeling multilingual topics aids exploration of large corpora across languages [1]. These models\nhelp align topics cross-lingually and uncover latent relationships between languages, such as ob-\nserving the differences in describing economic issues between English and Spanish speakers [2].\nIncorporating multilingual information also forms better monolingual topics [3].\nMultilingual topic models usually depend on some resource to bridge languages. These resources\ninclude word alignments [4], dictionaries [3, 5], topic alignments in documents [6], or all of the\nabove [7]. Existing multilingual models have several shortcomings; they assume extensive knowledge\nabout languages, preclude human re\ufb01nement, and are slow. Thus, a topic model may not be\nappropriate in emergent sitations on low resource languages when time is of the essence: e.g.,\nwhen relief workers must triage relief messages in Hatian Creole [8].\nBeyond these practical concerns, adding interactivity to topic modeling allows machine learning\nnon-experts to build models better suited to their needs [9\u201311]. One way to quickly incorporate human\nknowledge into the model is through anchor words [12]. Inference in anchor-based topic models is\ndriven by anchors, which are words that have high probability in one topic and low probability in\nremaining topics [13, 14]. The anchoring algorithm scales with the number of unique word types,\nmaking it fast enough for interactive updates.\nWe present two contributions for modeling multilingual topics. First, we develop a multilingual\nanchoring algorithm, which is an extension to anchor-based topic inference for comparable corpora.1\nSecond, we introduce MTAnchor, a human-in-the-loop system that uses multilingual anchoring\nto align topics and enables users to make further adjustments to the model.2 Through interaction,\nthe model produces interpretable, low-dimensional representations of documents. These vector\nrepresentations improve intra-lingual or cross-lingual text classi\ufb01cation. The topic model generates\ncoherent topic aligments for comparable corpora because users themselves align topics.\n\n1Comparable corpora across languages are collections of documents about the same themes but that are not\n\ntranslations. Compared to more typical parallel data [15, 16], comparable data are more challenging.\n\n2http://github.com/forest-snow/mtanchor_demo.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Anchor-based topic models\n\nA topic model discovers topics: a distribution over words that evinces a coherent theme [17].\nWell-known methods for constructing topic models are latent Dirichlet allocation [18, LDA] and\nlatent semantic analysis [19, LSA]. Another computationally attractive option is the anchor word\nalgorithm [13] that uses the row-normalized word co-occurrence matrix \u00afQ , where \u00afQi,j = p(w2 =\nj | w1 = i). The vector \u00afQi is the ith row of \u00afQ and represents the conditional distribution of words in\na document given that word i has occurred. Anchor word s appears with high probability in only\none topic, so \u00afQs resembles a topic\u2019s word distribution in topic models like LDA. For example, if\n\u201cconcealer\u201d is an anchor word for a cosmetics topic, then its conditional distribution will have high\nprobability for cosmetics-related words and low probability for other words. Still, these are not the\ndistributions that typically de\ufb01ne probabilistic topic models: the probability of a word given a topic.\n\n2.1 Anchoring\n\nTo discover topic distributions, anchor word approaches [14] search for coef\ufb01cients that describe\nnon-anchor words\u2019 document contexts with anchor words\u2019 conditional distributions. The word \u201cliner\u201d\nhas meanings that are explained by \u201calbum\u201d in a music topic, \u201cconcealer\u201d in a cosmetics topic, and\n\u201ccarburetor\u201d in an automotive topic. Then, the conditional distribution of \u201cliner\u201d can be expressed\nas a convex combination of the conditional distributions of \u201calbum\u201d, \u201cconcealer\u201d, and \u201ccarburetor\u201d.\nGiven anchor words s1, . . . , sK, the conditional distribution of word i can be approximated as\n\n\u00afQi \u2248 K(cid:88)\n\nK(cid:88)\n\nk=1\n\nV(cid:88)\n\n(1)\nThe coef\ufb01cient Ci,k represents p(z = k | w = i), the probability of topic k given a word i. These\ncoef\ufb01cients are recovered using the RecoverL2 algorithm [14], which minimizes the quadratic loss\n\nsubject to\n\nCi,k \u00afQsk\n\nk=1\n\nCi,k = 1 and Ci,k \u2265 0.\n\nk=1 Ci,k \u00afQsk. Using Bayes\u2019 rule, we can obtain the standard topic matrix A,\n\nbetween \u00afQi and(cid:80)K\n\nAi,k = p(w = i| z = k) \u221d p(z = k | w = i)p(w = i) = Ci,k\n\n\u00afQi,j.\n\n(2)\n\nj=1\n\nFor a large vocabulary size V , \ufb01nding these anchor words is a challenge, but understanding the\ngeometric intuition behind the anchoring algorithm can help us select the right words. Points inside a\nconvex hull are expressed as the convex combination of their vertices. If we want to approximate \u00afQi\nas the convex combination of \u00afQs1, . . . , \u00afQsK (Equation 1), then \u00afQs1, . . . , \u00afQsK should be the vertices\nof the convex hull of \u00afQ. However, \ufb01nding the vertices to a V -dimensional convex hull is time-\nconsuming [13]. Instead, Arora et al. [14] use FastAnchorWords, a greedy approach similar to\nGram-Schmidt orthogonalization, to construct an approximate convex hull of \u00afQ and expand it as\nmuch as possible with each choice of anchor word. Other methods include projecting \u00afQ to a low-\ndimensional space and \ufb01nding the vertices of its exact convex hull [20], adding another dimension to\ncapture metadata [21], or \ufb01nding nonparametric anchor words [22].\n\n2.2 Multiword anchoring\n\nFinding topics in anchor-based models is fast, so it can be used in an interactive setting where users\niteratively choose anchor words for every topic [12]. Nevertheless, users may want to choose multiple\nanchor words for a topic, such as selecting both \u201cconcealer\u201d and \u201clipstick\u201d for a cosmetics topic.\nTherefore, Lund et al. [12] propose multiword anchoring: users select a set Gk of multiple anchor\nwords for topic k. After users select G1, . . . ,GK, \u00afQ is augmented so that new rows \u00afQV +1, . . . , \u00afQV +K\nrepresent these pseudo-anchors in the conditional word co-occurrence space. Lund et al. [12] construct\nthese vectors \u00afQV +k as\n\n\uf8eb\uf8ec\uf8ed\n\n(cid:80)\n\ni\u2208Gk\n\ni,j\n\n\u00afQ\u22121\n|Gk|\n\n\uf8f6\uf8f7\uf8f8\u22121\n\n\u00afQV +k,j =\n\n.\n\n(3)\n\nThe motivation for using the harmonic mean (Equation 3) is that the function can centralize input\nvalues and ignore large outliers. Finding topics follows the same algorithm as before using single word\nanchors. Instead of modeling \u00afQi as the convex combination of \u00afQs1, . . . , \u00afQsK , a convex combination\nof \u00afQV +1, . . . , \u00afQV +K models \u00afQi with minimal quadratic loss.\n\n2\n\n\fFigure 1: Visualizing the importance of choice in anchor words for approximating conditional\ndistributions. The chosen anchor words are the black dots and their span is the white triangle. On the\nleft, the span of anchor words is small, so the words \u201cmelody\u201d and \u201cliner\u201d are too close together. On\nthe right, the span of anchor words is large, so the conditional distributions of words \u201cmelody\u201d and\n\u201cliner\u201d are approximated more accurately.\n\n3 Bridging languages: How do you say anchor in Chinese?\n\nAnchor-based topic models are well-de\ufb01ned for individual languages, but a multilingual model\nrequires topics that are thematically connected across languages. Discovering two separate sets of\nanchor words does not suf\ufb01ce. In this section, we propose multilingual anchoring as an algorithm to\ncross-lingually link topics and their corresponding anchor words.\nFirst, we can connect anchor words across languages as anchor links. For example, \u201canchor\u201d may be\nlinked to \u201c\u9328(m\u00e1o)\u201d in Chinese under a nautical context. After anchor words are linked, all words in\nthe same topic across languages will be form a coherent multilingual topic. A straightforward way to\nlink words across languages is through a dictionary, much as a human would. Just as possessing a\nChinese dictionary does not enable someone to speak Chinese, a dictionary does not magically create\nmultilingual topics. To construct an overall coherent model, anchor links should be carefully selected.\nWe de\ufb01ne these links in more detail. A language L is a set of word types w. A bilingual dictionary B\nis a subset of the Cartesian product L(1) \u00d7 L(2) , where L(1),L(2) are two different languages. An\nelement (w(1), w(2)) of B represents a dictionary entry where words w(1) \u2208 L(1) and w(2) \u2208 L(2)\nare translations of each other. While B is a binary relation, it is not necessarily a function. Other\nmultilingual topic models require that the dictionary is a one-to-one correspondence [3, 23, 2]. We\nrelax this restriction on B to extract as much information from the dictionary as possible.\nWe could select anchor words s1, ..., sK independently for each language by considering all\nwords w(1) \u2208 L(1) and w(2) \u2208 L(2) as possible candidates for anchors (e.g., independent runs\nof anchor algorithm). Instead, we want to jointly choose anchor words for both languages. First,\nwe use dictionary entries to create links between words. Then, we choose anchor words s(1)\nfor\nk\nLanguage 1 and s(2)\nare linked. Through this process, we\nk\nobtain a set of K anchor words for each language and can obtain topics using RecoverL2 [14].\n\nfor Language 2 such that s(1)\nk\n\nand s(2)\nk\n\n3.1 Multilingual anchoring\n\nIf there is only one anchor word for each topic, our goal of building a coherent multilingual topic\nmodel would fail. Any imperfection in the dictionary would scupper the topic model. Fortunately,\nArora et al. [14] assert that there exist many anchor word choices for a topic. Even if we reduce the\npool for candidate anchors, we can still \ufb01nd suitable anchor words for each topic. Recall that anchor\nwords are the vertices to the convex hull of words in the conditional distribution space (Section 2).\nFinding the actual vertices of the convex hulls is too expensive, so FastAnchorWords searches for a set\nof anchors with maximal span. This span should approximate the convex hull of \u00afQ. Without a large\nenough span, we can never \ufb01nd accurate approximations for words in the conditional distribution\nspace. All words w will have indistinguishable conditional distributions (Figure 1). As a result, every\ntopic will have indistinct word distributions and the resulting topics will be copies of one another.\n\n3\n\nconcealercosmopolitanlipsticklinerconcealercarburetoralbumlinermelodymelody\fFigure 2: Selecting anchor links for multilingual anchoring. The purple (blue) area represents the\nconditional distribution space of words in the English (Chinese) corpus. The white triangle designates\nthe space spanned by chosen anchor words. Dashed lines depict anchor links across spaces. Black\npoints denote words already chosen as anchors, white points are unchosen words, and pink stars are\nmost optimal anchors for the current iteration. Multilingual anchors should maximize area spanned\nby white triangles in both spaces.\n\nTo maximize span of anchor words, FastAnchorWords [14] chooses anchor word sk such that\n\nd(cid:0)span(cid:0) \u00afQs1 , ..., \u00afQsk\u22121\n\n(cid:1) , \u00afQw\n\n(cid:1) ,\n\nsk = argmax\n\nw\n\n(4)\n\nwhere d(P, i) is de\ufb01ned as the Euclidean distance from point i to subspace P , or the norm of the\nprojection of i onto the orthogonal complement of P .\nTo extend the greedy approach to multilingual settings, we need anchor words that can guide topic\ninference in multiple languages. This motivates our approach for linking words with a dictionary.\nBy choosing linked anchor words, the algorithm can align topics cross-lingually so that the aligned\ntopics form one multilingual topic. However, randomly choosing translation pairs as anchor links\nwill not produce coherent multilingual topics. We need multilingual anchors that also inherit the\ngeometric properties of monolingual anchors. So, the span of anchor words should be maximized\nin both languages for optimal topic inference. To clearly state our objective, we de\ufb01ne P (l)\nas the\nsubspace spanned by j chosen anchor words in the conditional distribution space of language l,\n\nj\n\n(cid:18)\n\n(cid:19)\n\nP (l)\nj = span\n\n\u00afQ(l)\ns(l)\n1\n\n, ..., \u00afQ(l)\ns(l)\nj\n\n.\n\n(5)\n\nk\u22121 and P (2)\n\nWord w is a good choice of a kth anchor if \u00afQw is far enough from P (l)\nk\u22121 so that having \u00afQw as an\nadditional vertex can greatly expand span of anchors. A word might be a great choice for an anchor\nin one language, but we cannot select it if its translation is a poor choice for the other language\n(Figure 2). We need to pick linked words w \u2208 L(1) and v \u2208 L(2) such that w is far from P (1)\nk\u22121 and v\nis also far away from P (2)\nk\u22121. Then, adding w and v as anchor words can increase total span of anchor\nword set in both languages. Using this intuition, we maximize the lower bound on the distance from\nanchor words to P (1)\n\n(cid:17)\nk \u2208 L(2) such that Equation 6 is satis\ufb01ed on every\nWe greedily select anchors s(1)\niteration k. Words with multiple translations are elegantly addressed:\nif an anchor word w is\npicked already, then it is not likely to be picked again. The algorithm expands both convex hulls\nsimultaneously with each iteration. Indeed, more translations aid our anchor search because there\nwill be more linked anchors to choose from. Even if the algorithm chooses anchor words similar in\nmeaning within the same language, interactivity can help remove duplicate topics (Section 3.2). After\n\nk\u22121. We select anchor words w and v such that\n\nsubject to (w, v) \u2208 B.\n\n(cid:110)\n(cid:16)\nP (1)\nk\u22121, \u00afQ(1)\nk \u2208 L(1), s(2)\n\nd\n\nw\n\nP (2)\nk\u22121, \u00afQ(2)\n\nv\n\ns(1)\nk , s(2)\n\nk = argmax\n\n(cid:17)(cid:111)\n\n(cid:16)\n\nmin\n\n(6)\n\nw,v\n\n, d\n\n4\n\nconcealerforestcarburetorcosmopolitanalbumbourgeoisie(cid:34712)(cid:4717)(cid:17173)(cid:34569)(cid:29589)(cid:4123)(cid:29770)(cid:31068)(cid:3723)(cid:2399)(cid:7095)(cid:18973)Bourgeoisie does not have a Chinese translation, so it cannot be picked as an anchor word even if it is the farthest word from the convex hull.(cid:2399)(cid:7095)(cid:18973)(cid:3)(cid:11)(cid:71)(cid:163)(cid:3)(cid:71)(cid:242)(cid:75)(cid:188)(cid:76)(cid:12)(cid:3)(cid:76)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:83)(cid:82)(cid:76)(cid:81)(cid:87)(cid:3)(cid:73)(cid:68)(cid:85)(cid:87)(cid:75)(cid:72)(cid:86)(cid:87)(cid:3)(cid:68)(cid:90)(cid:68)(cid:92)(cid:3)(cid:73)(cid:85)(cid:82)(cid:80)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:38)(cid:75)(cid:76)(cid:81)(cid:72)(cid:86)(cid:72)(cid:3)(cid:70)(cid:82)(cid:81)(cid:89)(cid:72)(cid:91)(cid:3)(cid:75)(cid:88)(cid:79)(cid:79)(cid:15)(cid:3)(cid:69)(cid:88)(cid:87)(cid:3)(cid:76)(cid:87)(cid:86)(cid:3)(cid:87)(cid:85)(cid:68)(cid:81)(cid:86)(cid:79)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:70)(cid:82)(cid:86)(cid:80)(cid:82)(cid:83)(cid:82)(cid:79)(cid:76)(cid:87)(cid:68)(cid:81)(cid:3)(cid:76)(cid:86)(cid:3)(cid:87)(cid:82)(cid:82)(cid:3)(cid:70)(cid:79)(cid:82)(cid:86)(cid:72)(cid:3)(cid:87)(cid:82)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:40)(cid:81)(cid:74)(cid:79)(cid:76)(cid:86)(cid:75)(cid:3)(cid:70)(cid:82)(cid:81)(cid:89)(cid:72)(cid:91)(cid:3)(cid:75)(cid:88)(cid:79)(cid:79)(cid:15)(cid:3)(cid:87)(cid:75)(cid:72)(cid:85)(cid:72)(cid:69)(cid:92)(cid:3)(cid:72)(cid:79)(cid:76)(cid:80)(cid:76)(cid:81)(cid:68)(cid:87)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:75)(cid:72)(cid:80)(cid:3)(cid:68)(cid:86)(cid:3)(cid:68)(cid:81)(cid:70)(cid:75)(cid:82)(cid:85)(cid:3)(cid:90)(cid:82)(cid:85)(cid:71)(cid:3)(cid:70)(cid:75)(cid:82)(cid:76)(cid:70)(cid:72)(cid:86)(cid:17)Forest and its translation (cid:31068)(cid:3723)(cid:3)(cid:11)(cid:86)(cid:203)(cid:81)(cid:79)(cid:176)(cid:81)(cid:12) are not the furthest points from their respective convex hull, but neither are too close. So, they are chosen as the next anchor words. \fFigure 3: The user interface for exploring topics in English and Chinese documents. Anchor words\nare in the center, while the most likely words for each topic are on the left and right sides of the\ninterface. The user can drag words from the side and add them as anchor words. When the user\nhovers over \u201c\u4e9e\u7a2e(y\u00e0zh\u02c7ong)\u201d, then its translation, \u201csubspecies\u201d, appears at the bottom of the screen.\nWhen the user presses on the word, all occurrences of it and its translation are highlighted in yellow.\nUsers can type words in the \u201cSearch words\u201d box to \ufb01nd which words are in the vocabulary. These\nfeatures help the user explore topics in an unfamiliar language.\n\npicking a set of anchor words for each language, multilingual anchoring follows FastAnchorWords\n(Section 2.1). Topic matrices A(1) and A(2) are separately recovered (Equations 1, 2). These matrices\nare the output of multilingual anchoring. In the next sections, we show how MTAnchor further\nupdates A(1) and A(2) based on human feedback.\n\nLacking dictionary entries.\nIf dictionary entries are scarce, then we cannot constrain the anchor\nwords to only be words from the dictionary. So, we independently \ufb01nd anchor words for each language\nusing RecoverL2. This reduction to monolingual settings resembles other cross-lingual models:\nJointLDA reduces to LDA and PTLDA reduces to TLDA when there are no dictionary entries [3, 7].\n\nPredicting labels from topics. Multilingual anchoring is an unsupervised method, but the topic\ndistribution acts as a low-dimensional representation for each document [24\u201326]. To infer the topic\ndistribution of documents, we pass in the topic matrices as inputs into variational inference [18],\nwhere topic variational parameter \u03b2 is \ufb01xed and only document variational parameter \u03b3 is \ufb01tted.\nThen, we train a linear SVM on the topic distributions of documents [27] to classify document labels.\n\n3.2\n\nInteractive topic alignment\n\nMultilingual anchoring uses translations to \ufb01nd anchor words that can lead to better topics for both\nlanguages. However, we cannot completely rely on dictionary entries to construct the topic model.\nIn reality, translations may not be available, could be a poor \ufb01t for the dataset, or might be wrong.\nIn addition to problems with the dictionary, the data may be too noisy, or the anchoring algorithm\nreturns a topic model unsuited for our needs (e.g., if a user needs to separate news from opinion and\nthe topic model puts them together). Thus, we incorporate interactivity into MTAnchor so that we\ncan extract linguistic and cultural knowledge from humans.\nFirst, MTAnchor takes in a comparable corpora and a bilingual dictionary as inputs. Next, it uses\nmultilingual anchoring (Section 3.1) to \ufb01nd sets of anchor words for each language. After the\nalgorithm recovers topic matrices, the interface shows information about the topic model. The user\ncan press on the red \u201cX\u201d to delete any incoherent or duplicate topics (Figure 3). The user can also\nadd new topics by pressing on \u201cAdd Topics\u201d. The interface will create a new blank row beneath the\n\n5\n\n\fTable 1: Comparison of multilingual topic modeling methods. Multilingual anchoring scores higher\nin classi\ufb01cation accuracy and topic coherence than MCTA. MTAnchor does as well as multilingual\nanchoring on average, but a few users can achieve the best results for every metric.\n\nDataset\n\nMethod\n\nWikipedia (EN-ZH) Multilingual anchoring\nMTAnchor (maximum)\nMTAnchor (median)\nMCTA\n\nAmazon (EN-ZH)\n\nLORELEI (EN-SI)\n\nMultilingual anchoring\nMCTA\n\nMultilingual anchoring\nMCTA\n\nClassi\ufb01cation accuracy\n\nTopic coherence\n\nEN-I\n\n69.49%\n80.71%\n69.49%\n51.56%\n59.79%\n49.53%\n20.78%\n12.99%\n\nZH-I\nSI-I\n\n71.24%\n75.33%\n71.44%\n33.35%\n61.10%\n50.64%\n32.65%\n26.53%\n\nEN-C\n\n50.37%\n57.62%\n50.27%\n23.24%\n51.73%\n50.27%\n24.49%\n4.08%\n\nZH-C\nSI-C\n\n47.76%\n54.54%\n47.22%\n39.79%\n53.20%\n49.49%\n24.68%\n15.58%\n\nEN-I\n\n0.141\n0.195\n0.141\n0.126\n0.069\n-0.028\n\n0.077\n0.132\n\nZH-I\nSI-I\n\n0.178\n0.198\n0.178\n0.085\n0.061\n0.019\n\n0.000\n0.000\n\nEN-E\n\n0.084\n0.103\n0.084\n0.000\n0.031\n0.017\n\n0.025\n0.036\n\nZH-E\nSI-E\n\n0.128\n0.147\n0.129\n0.037\n0.045\n0.011\n\nn/a\nn/a\n\nexisting topics. Then, the user can add words as anchors to the new topic. These features are similar\nto the ones used for interactively modeling monolingual topics [12].\nOnce the user \ufb01nishes choosing anchor words for each topic, they press \u201cUpdate Topics\u201d. This\nis a signal for MTAnchor to retrieve new anchor words from the interface and run multiword\nanchoring (Section 2.2). The algorithm approximates \u00afQw for every word w in the vocabulary and\nthen recomputes the topic matrices for each language. When MTAnchor \ufb01nds new topics, the user\ncan see the updated topics on the interface. At this point, anchors no longer have to be linked by\ndictionary entries because MTAnchor does not select anchors based on Equation 6. After the initial\nalignment, users de\ufb01ne anchors and customize the topic model to their own needs.\n\n4 Experiments\n\nThe \ufb01rst dataset consists of Wikipedia articles: 11,043 in English and 10,135 in Chinese. We shorten\nthe articles to contain no more than three sections. We lemmatize the English articles using WordNet\nLemmatizer [28] and segment the Chinese articles using Stanford CoreNLP [29]. For both languages,\nthe articles fall under one of six categories: \ufb01lm, music, animals, politics, religion, and food.\nAnother dataset consists of Amazon reviews: 53,558 in English and 53,160 in Chinese (mostly from\nTaiwan) [30]. Each review has a rating, ranging from one to \ufb01ve. Since about half of the reviews\nhave a rating of \ufb01ve, we change the classi\ufb01cation task to a binary problem by labeling reviews with\nrating of \ufb01ve as \u201c1\u201d and the rest as \u201c0\u201d. For the Wikipedia and Amazon datasets, the training-test\nsplit is set to 80:20. For the Chinese-English dictionary, we use entries from MDBG.3\nTo test low-resource languages, we use data from the LORELEI Sinhalese language pack [31]. These\nlanguage packs are created to develop technologies that can process data in low-resource languages.\nIn the pack, only a small subset of documents are labeled based on need type.4 So, we treat the\nclassi\ufb01cation task as a semi-supervised problem. There are eight possible labels: evacuation, food\nsupply, search/rescue, utilities, infrastructure, medical assistance, shelter, and water supply [32]. Out\nof the 1,100 (4,790) English (Sinhalese) documents, only 77 (49) of them have labels. For each\nlanguage, half of the labeled documents are in the training set and the other half are in the test set.\nFor the Sinhalese-English dictionary, we use entries from the LORELEI Sinhalese language pack.\nWe run experiments to evaluate three methods: multilingual anchoring, MTAnchor, and MCTA\n(Multilingual Cultural-common Topic Analysis) [33]. We choose MCTA as a baseline because it is\na recent work on multilingual topic models with readily available code and aligns topics using a\nbilingual dictionary. We train models on multilingual anchoring and MCTA with twenty topics. For\nMTAnchor, we initially show users twenty topics, but the \ufb01nal number of topics is their choice. All\nmethods are implemented in Python on a 2.3 GHz Intel Core i5 processor.\n\n3https://www.mdbg.net/chinese/dictionary?page=cc-cedict.\n4Documents in LORELEI language pack have multiple need types, but we have simpli\ufb01ed the classi\ufb01cation\n\ntask by assigning only the \ufb01rst label to each document.\n\n6\n\n\fFigure 4: Classi\ufb01cation accuracy over time until MCTA converges. For the Wikipedia dataset,\nmultilingual anchoring converges within 5 minutes, but MCTA takes 5 hours and 18 minutes to\nconverge. Multilingual anchoring outperforms MCTA in speed and classi\ufb01cation accuracy.\n\nThe data for the MTAnchor user study are the English-Chinese Wikipedia articles. We invite twenty\nparticipants on Amazon Mechanical Turk (MTurk) to partake in the study. Each user is given thirty\nminutes to interact with the interface.5 MTAnchor scales with the number of unique word types,\nrather than number of documents or number of words in the documents, so updates to the system take\nno longer than seven seconds on average. We only approve HITs from workers who have completed\nthe task for the \ufb01rst time. After worker \ufb01nishes the task, the interface provides a unique code for\nthem to enter on MTurk. These rules ensure fair assessment of workers\u2019 interaction with MTAnchor.\n\n4.1 Evaluating multilingual topics\n\nIdeally, topic models should have topics that are interpretable and useful as classi\ufb01cation features.\nSo, we primarily base evaluation on two measures: classi\ufb01cation accuracy and topic coherence.\nMeasuring topic coherence considers both intrinsic and extrinsic scores [34]. The difference between\nthe two is the reference corpus.6 The intrinsic score uses the trained corpus itself, whereas the extrinsic\nscore uses an external, larger dataset. The Sinhalese extrinsic coherence scores are not available\nbecause a large reference corpus cannot be formed for low-resource languages. By measuring both,\nwe can evaluate the model\u2019s interpretability within a local and global context.\nWe evaluate these metrics separately for each language: English (EN), Chinese (ZH), and Sin-\nhalese (SI). To classify labels from topics, we use the same procedure as described in Section 3.1.\nThen, we measure intra-lingual (I) and cross-lingual accuracy (C) with F1 scores. Intra-lingual\naccuracy refers to percentage of documents classi\ufb01ed correctly using a classi\ufb01er trained on documents\nin the same language. Cross-lingual accuracy refers to percentage of documents classi\ufb01ed correctly\nusing a classi\ufb01er trained on documents in a different language (testing the algorithm\u2019s ability to\ngeneralize). For topic coherence, we use the NPMI (normalized pointwise mutual information) variant\nof automated topic intepretability scores over the \ufb01fteen most probable words in a topic [34]. For\nintrinsic scores (I), we use the trained corpus itself as the reference corpus. For extrinsic scores (E),\nwe use 2.2M English Wikipedia articles and 1.1M Chinese Wikipedia articles.\nDuring the user study, we hold out 100 documents as a development set for each corpus. Each time\nthe user updates topics, the interface shows classi\ufb01cation accuracy on the development set. When the\nuser \ufb01nally submits \ufb01nal anchor words, we evaluate their topics on the test set.\n\n5Synopsis of user instructions: \u201cThere are 11,000 English Wikipedia articles and 10,000 Chinese Wikipedia\narticles, which belong to one of six categories: \ufb01lm, music, animals, politics, religion, food. Your goal is to \ufb01nd\ntopics that can help classify documents within 30 minutes.\u201d\n\n6Measuring topic coherence requires a reference corpus to sample lexical probabilities.\n\n7\n\n00.20.40.60.8F1 ScoreTrain: ChineseMethodMCTAMTAnchor (max)Multilingual AnchoringTrain: EnglishTest: Chinese010020030000.20.40.60.80100200300Time (min)Test: English\fFigure 5: Classi\ufb01cation accuracy of each participant in the MTAnchor user study over time. Each\nplot indicates the language of topics that the classi\ufb01er is trained on and the language of topics that the\nclassi\ufb01er is tested on. The black horizontal line denotes multilingual anchoring score (no interactive\nupdates). Each colored line represents a different user interaction and shows the \ufb02uctuation in scores\non development set (left). Each colored point represents the \ufb01nal classi\ufb01cation score on the test set;\nthe point\u2019s x-coordinate indicates total duration of user\u2019s session (right).\n\n4.2 Results\n\nIn experiments, multilingual anchoring converges much faster than MCTA (Figure 4). We compare\nscores across experiments for multilingual anchoring, MTAnchor, and MCTA, but only report the\nmaximum and median scores from MTAnchor user experiments (Table 1). For English-Chinese\ndatasets, multilingual anchoring performs better than MCTA in all metrics. For English-Sinhalese\nLORELEI dataset, topics from multilingual anchoring are more useful for classi\ufb01cation tasks but are\nless coherent than MCTA topics.\nIn every metric, the MTAnchor maximum score across all users is higher than scores from other\nmethods (Table 1). The MTAnchor median score across all users is approximately same as those of\nmultilingual anchoring for all metrics. A few users outperform multilingual anchoring by spending\nmore time interacting with the model (Figure 5). Within thirty minutes, a user can improve topic\ncoherence and reach up to a 0.40 increase in any one of the classi\ufb01cation scores.\n\n5 Related work and discussion\n\nPrior work on multilingual topic models mainly follow a generative approach. The Polylingual\nTopic Model [1] assumes that documents are topically aligned to track topic trends across languages.\nJointLDA [3] makes use of a bilingual dictionary and introduces \u201cconcepts\" as a way to connect\nwords from different languages. The model learns better monolingual models through optimizing\ncross-lingual corpora than LDA does when trained only on monolingual data. The Polylingual Tree-\nbased Topic Model [7] builds tree priors to incorporate word correlation and document alignment\ninformation. MCTA [33] is another generative, multilingual model, but uses dictionary entries to\ncapture \u201ccultural-common\u201d topics.\nMultilingual anchoring is a spectral approach to modeling multilingual topics. The algorithm\nconverges much faster than generative methods (Figure 4) and resulting topics form better vector\nrepresentations for documents (Table 1). An advantage of anchoring over generative models is its\nrobustness and practicality [14]. Generative methods need long documents to correctly estimate\ntopic-word distributions, but anchoring handles documents of any size [13]. This is evident in models\nbuilt on the Amazon dataset, which contains reviews with only one to three sentences. The health\ntopic for multilingual anchoring is more interpretable than that of MCTA (Table 2).\nArora et al. [14] observe that more speci\ufb01c words appear in the top words of anchor-based topics.\nThis is clearly shown in the LORELEI experiments; a topic from MCTA has general words like \u201chelp\u201d\nand \u201cneed\u201d, while a topic from multilingual anchoring has speci\ufb01c words like \u201caranayanke\u201d and\n\u201cnbro\u201d (Table 2). Both topics are about the 2016 Sri Lankan \ufb02oods, but the topic from MCTA cannot\n\n8\n\n0.40.50.60.70.80.9F1 ScoreTrain: ChineseTrain: EnglishDev: Chinese010200.40.50.60.70.80.901020Time (min)Dev: English0.40.50.60.70.8F1 ScoreTrain: ChineseTrain: EnglishTest: Chinese10152025300.40.50.60.70.81015202530Time (min)Test: English\fTable 2: Top seven words of sample English and Chinese topics are shown with anchors bolded.\nTopics from multilingual anchoring and MTAnchor are more relevant to document labels, thereby\nmaking them more useful as features for classi\ufb01cation.\n\nDataset\nWikipedia MCTA\n\nMethod\n\nMultilingual anchoring\n\nMTAnchor\n\nAmazon\n\nMCTA\n\nMultilingual anchoring\n\nLORELEI\n\nMCTA\nMultilingual anchoring\n\nTopic\ndog san movie mexican \ufb01ghter novel california\n\u4e3b\u6f14 \u6539\u7de8 \u672c \u5c0f\u8aaa \u62cd\u651d \u89d2\u8272 \u6230\u58eb\nadventure daughter bob kong hong robert movie\n\u4e3b\u6f14 \u6539\u7de8 \u672c\u7247 \u98fe\u6f14 \u5192\u5192\u5192\u96aa\u96aa\u96aa \u8b1b\u8ff0 \u7de8\u5287\nkong hong movie of\ufb01ce martial box reception\n\u4e3b\u6f14 \u6539\u7de8 \u98fe\u6f14 \u672c\u7247 \u6f14\u6f14\u6f14\u54e1\u54e1\u54e1 \u7de8\u7de8\u7de8\u5287\u5287\u5287 \u8b1b\u8ff0\nwoman food eat person baby god chapter\n\u4f86\u8ca8 \u9802\u9802 \u6c34 \u8033\u6a5f \u8ca8\u7269 \u5f35\u5091 \u5091 \u540c\u6a23\neat diet food recipe healthy lose weight\n\u5065\u5065\u5065\u5eb7\u5eb7\u5eb7 \u5e6b \u5403 \u8eab\u9ad4 \u5168\u9762 \u540c\u4e8b \u4e2d\u91ab\nhelp need \ufb02oodrelief please families needed victim\naranayake warning landslide site missing nbro areas\n\nspecify the \u201cneed\u201d type of documents. So, accuracy is higher when using topics from multilingual\nanchoring to classify documents. However, LORELEI experiments show that multilingual anchoring\ntopics are less interpretable than MCTA topics. This might be caused by the obscure top topic words.\nArayanake is a Sri Lankan town and \u201cnbro\u201d stands for National Building Research Organization.\nThese words may have lowered coherence because they do not co-occur frequently with other top\ntopic words. In this case, using MTAnchor can possibly increase topic coherence.\nIn the user study, a few participants create topics that are more applicable for speci\ufb01c tasks. In one\nexperiment, a user \ufb01nds the topic with anchor words \u201cadventure\u201d and \u201c\u5192\u96aa(m\u00e0oxi\u02c7an)\u201d too vague.\nThe user knows that the task is to classify Wikipedia articles into one of six categories, so they add\nmovie-related terms as anchors, like \u201cmovie\u201d, \u201c\u6f14\u54e1(y\u02c7anyu\u00e1n)\u201d, and \u201c\u7de8\u5287(bi\u00afanj\u00f9)\u201d. Afterward,\ntheir topics signi\ufb01cantly improves in classi\ufb01cation accuracy and coherence. Other participants do not\nsigni\ufb01cantly change the topic model through interactive updates. More work can look into improving\nMTAnchor so that updates change topic distributions more drastically.\nInterestingly, the scores for English topics increase considerably after user interaction compared to\nChinese topics (Table 1). The participants are anonymous MTurk workers, so we are not aware of\ntheir language skills. We believe that workers are most likely \ufb02uent in English because the MTurk\nwebsite is only available in English. If this fact holds true, then it can explain why the English topics\nhave much higher scores than the Chinese ones. It also shows that people can improve topic models\nwith prior knowledge, which supports the need for human-in-the-loop algorithms. In the future, it\nwould be interesting to observe how language \ufb02uency affects quality of multilingual topics.\n\n6 Conclusion\n\nWe present spectral and interactive topic models for multilingual document collections. The goal is\nto bridge the language gap using a multitude of resources: a dictionary, corpora, statistical models,\nand human input. A model that relies entirely on one resource is impractical for use in many\nsettings, especially for low-resource situations. Multilingual anchoring can work with or without\nlabel supervision. Dictionary entries can be scarce or not fully accurate. People can use MTAnchor\nwithout a deep knowledge of topic modeling or machine learning. The method\u2019s versatility and speed\nmake it an alternative to models like neural networks, which need a preponderance of labeled data.\nFuture work can focus on understanding the effect of human input on multilingual topic models and\naccurately re\ufb02ecting their feedback in cross-lingual representations.\n\n9\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for their insightful and constructive comments. Additionally, we\nthank Leah Findlater, Jeff Lund, Thang Nguyen, Shi Feng, Mozhi Zhang, Weiwei Yang, Eric Wallace,\nand Manasij Venkatesh for their helpful feedback. This work was supported in part by the JHU\nHuman Language Technology Center of Excellence (HLTCOE) and Raytheon BBN Technologies,\nby DARPA award HR0011-15-C-0113. Any opinions, \ufb01ndings, conclusions, or recommendations\nexpressed here are those of the authors and do not necessarily re\ufb02ect the view of the sponsors.\n\nReferences\n[1] Mimno, D., H. M. Wallach, J. Naradowsky, et al. Polylingual topic models. In Proceedings of\n\nEmpirical Methods in Natural Language Processing. 2009.\n\n[2] Guti\u00e9rrez, E. D., E. Shutova, P. Lichtenstein, et al. Detecting cross-cultural differences using a\nmultilingual topic model. Transactions of the Association for Computational Linguistics, 2016.\n\n[3] Jagarlamudi, J., H. Daum\u00e9. Extracting multilingual topics from unaligned comparable corpora.\n\nIn Proceedings of the European Conference on Information Retrieval. 2010.\n\n[4] Zhao, B., E. P. Xing. BiTAM: Bilingual topic admixture models for word alignment.\n\nProceedings of International Conference on Computational Linguistics. 2006.\n\nIn\n\n[5] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual super-\nvised latent dirichlet allocation. In Proceedings of Empirical Methods in Natural Language\nProcessing. 2010.\n\n[6] Ni, X., J.-T. Sun, J. Hu, et al. Mining multilingual topics from Wikipedia. In Proceedings of the\n\nWorld Wide Web Conference. 2009.\n\n[7] Hu, Y., K. Zhai, V. Eidelman, et al. Polylingual tree-based topic models for translation domain\n\nadaptation. In Proceedings of the Association for Computational Linguistics. 2014.\n\n[8] Morrow, N., N. Mock, A. Papendieck, et al. Independent evaluation of the Ushahidi Haiti\n\nproject. Development Information Systems International, 2011.\n\n[9] Choo, J., C. Lee, C. K. Reddy, et al. Utopian: User-driven topic modeling based on interactive\nnonnegative matrix factorization. IEEE transactions on visualization and computer graphics,\n2013.\n\n[10] Hu, Y., J. Boyd-Graber, B. Satinoff, et al. Interactive topic modeling. Machine Learning, 2014.\n\n[11] Lee, T. Y., A. Smith, K. Seppi, et al. The human touch: How non-expert users perceive, interpret,\n\nand \ufb01x topic models. International Journal of Human-Computer Studies, 2017.\n\n[12] Lund, J., C. Cook, K. Seppi, et al. Tandem anchoring: A multiword anchor approach for\ninteractive topic modeling. In Proceedings of the Association for Computational Linguistics.\n2017.\n\n[13] Arora, S., R. Ge, A. Moitra. Learning topic models\u2013going beyond SVD. In Foundations of\n\nComputer Science (FOCS). 2012.\n\n[14] Arora, S., R. Ge, Y. Halpern, et al. A practical algorithm for topic modeling with provable\n\nguarantees. In Proceedings of the International Conference of Machine Learning. 2013.\n\n[15] Mauro, C., G. Christian, F. Marcello. Wit3: Web inventory of transcribed and translated talks.\n\nIn Proceedings of the European Association for Machine Translation. 2012.\n\n[16] Graff, D. UN Parallel Text, 1994. https://catalog.ldc.upenn.edu/LDC94T4A.\n[17] Boyd-Graber, J., Y. Hu, D. Mimno. Applications of topic models. Foundations and Trends R(cid:13) in\n\nInformation Retrieval, 2017.\n\n[18] Blei, D. M., A. Y. Ng, M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 2003.\n\n10\n\n\f[19] Landauer, T. K., P. W. Foltz, D. Laham. An introduction to latent semantic analysis. Discourse\n\nprocesses, 1998.\n\n[20] Lee, M., D. Mimno. Low-dimensional embeddings for interpretable anchor-based topic infer-\n\nence. In Proceedings of Empirical Methods in Natural Language Processing. 2014.\n\n[21] Nguyen, T., J. Boyd-Graber, J. Lund, et al. Is your anchor going up or down? fast and accurate\nsupervised topic models. In Conference of the North American Chapter of the Association for\nComputational Linguistics. 2015.\n\n[22] Yurochkin, M., A. Guha, X. Nguyen. Conic scan-and-cover algorithms for nonparametric topic\n\nmodeling. In Proceedings of Advances in Neural Information Processing Systems. 2017.\n\n[23] Boyd-Graber, J., D. M. Blei. Multilingual topic models for unaligned text. In Proceedings of\n\nUncertainty in Arti\ufb01cial Intelligence. 2009.\n\n[24] Bengio, Y., A. Courville, P. Vincent. Representation learning: A review and new perspectives.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.\n\n[25] Xiao, M., Y. Guo. A novel two-step method for cross language representation learning. In\n\nProceedings of Advances in Neural Information Processing Systems. 2013.\n\n[26] Rastogi, P., B. Van Durme, R. Arora. Multiview LSA: Representation learning via generalized\nCCA. In Conference of the North American Chapter of the Association for Computational\nLinguistics. 2015.\n\n[27] Fan, R.-E., K.-W. Chang, C.-J. Hsieh, et al. LIBLINEAR: A library for large linear classi\ufb01cation.\n\nJournal of Machine Learning Research, 2008.\n\n[28] Bird, S., E. Klein, E. Loper. Natural language processing with Python: analyzing text with the\n\nnatural language toolkit. \" O\u2019Reilly Media, Inc.\", 2009.\n\n[29] Manning, C., M. Surdeanu, J. Bauer, et al. The Stanford CoreNLP natural language processing\n\ntoolkit. In Proceedings of the Association for Computational Linguistics. 2014.\n\n[30] Constant, N., C. Davis, C. Potts, et al. The pragmatics of expressive content: Evidence from\n\nlarge corpora. Sprache und Datenverarbeitung, 2009.\n\n[31] Strassel, S., J. Tracey. LORELEI language packs: Data, tools, and resources for technology\ndevelopment in low resource languages. In Language Resources and Evaluation Conference.\n2016.\n\n[32] Strassel, S., A. Bies, J. Tracey. Situational awareness for low resource languages: the LORELEI\nsituation frame annotation task. In Exploitation of Social Media for Emergency Relief and\nPreparedness. 2017.\n\n[33] Shi, B., W. Lam, L. Bing, et al. Detecting common discussion topics across culture from news\n\nreader comments. In Proceedings of the Association for Computational Linguistics. 2016.\n\n[34] Lau, J. H., D. Newman, T. Baldwin. Machine reading tea leaves: Automatically evaluating topic\ncoherence and topic model quality. In Proceedings of the European Chapter of the Association\nfor Computational Linguistics. 2014.\n\n[35] Hao, S., M. J. Paul, J. Boyd-Graber. Lessons from the bible on modern topics: Multilingual topic\nmodel evaluation on low-resource languages. In Conference of the North American Chapter of\nthe Association for Computational Linguistics. 2018.\n\n[36] Nguyen, T., Y. Hu, J. Boyd-Graber. Anchors regularized: Adding robustness and extensibility\nto scalable topic-modeling algorithms. In Proceedings of the Association for Computational\nLinguistics. 2014.\n\n11\n\n\f", "award": [], "sourceid": 5230, "authors": [{"given_name": "Michelle", "family_name": "Yuan", "institution": "University of Maryland, College Park"}, {"given_name": "Benjamin", "family_name": "Van Durme", "institution": "Johns Hopkins University"}, {"given_name": "Jordan", "family_name": "Ying", "institution": "University of Maryland"}]}