{"title": "Sphere Embedding: An Application to Part-of-Speech Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 1567, "page_last": 1575, "abstract": "Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a high-dimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks.", "full_text": " \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\nSphere Embedding : \n\nAn Application to Part-of -Speech Induction \n\nYariv Maron \n\nMichael Lamar \n\nGonda Brain Research Center \n\nDepartment of Mathematics and Computer Science \n\nBar-Ilan University \n\nRamat-Gan 52900, Israel \n\nsyarivm@yahoo.com \n\nSaint Louis University \n\nSt. Louis, MO 63103, USA \n\nmlamar@slu.edu \n\n \n \n \n \n \n \n \n\nElie Bienenstock \n\nDivision of Applied Mathematics \nAnd Department of Neuroscience \n\nBrown University \n\nProvidence, RI 02912, USA \n\nelie@brown.edu \n\n \n\n \n\n \n \n \n \n\nAbstract \n\nMotivated by an application to unsupervised part-of-speech tagging, we \npresent an algorithm for the Euclidean embedding of large sets of \ncategorical data based on co-occurrence statistics. We use the CODE model \nof Globerson et al. but constrain the embedding to lie on a hig h-\ndimensional unit sphere. This constraint allows for efficient optimization, \neven in the case of large datasets and high embedding dimensionality . \nUsing k-means clustering of the embedded data, our approach efficiently \nproduces state-of-the-art results. We analyze the reasons why the sphere \nconstraint is beneficial in this application, and conjecture that these reasons \nmight apply quite generally to other large-scale tasks. \n\n \n\n1 \n\nI n t ro d u c t i o n \n\nThe embedding of objects in a low-dimensional Euclidean space is a form of dimensionality \nreduction that has been used in the past mostly to create 2D representations of data for the \npurpose of visualization and exploratory data analysis [10, 13]. Most methods work on \nobjects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], \nembed objects of heterogeneous types, based on their co -occurrence statistics. In this paper \nwe demonstrate that the latter can be successfully applied to unsupervised part-of-speech \n(POS) \nlanguage \nprocessing [1, 4, 5, 6, 7]. \n\ninduction, an extensively studied, challenging, problem \n\nin natural \n\nThe problem we address is distributional POS tagging, in which words are to be tagged \nbased on the statistics of their immediate left and right context in a corpus (ignoring \nmorphology and other features). The induction task is fully unsupervised, i.e., it uses no \nannotations. This task has been addressed in the past using a variety of methods. Some \napproaches, such as [1], combine a Markovian assumption with clustering. Many recent \nworks use HMMs, perhaps due to their excellent performance on the supervised version of \nthe task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to \ndate for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. \n\nUsing a heterogeneous-data embedding approach for this task, we define separate embedding \n\n\ffunctions for the objects \"left word\" and \"right word\" based on their co -occurrence statistics, \ni.e., based on bigram frequencies. We are interested in modeling the statistical interactions \nbetween left words and right words, as relevant to POS tagging, rat her than their joint \ndistribution. Indeed, modeling the joint distribution directly results in models that do not \nhandle rare words well. We use the CODE (Co-Occurrence Data Embedding) model of [3], \nwhere statistical interaction is modeled as the negative exponential of the Euclidean distance \nbetween \nthe marginal \nprobabilities, or unigram frequencies, in a way that results in appropriate handling of both \nfrequent and rare words. \n\nthe embedded points. This embedding model \n\nincorporates \n\nThe size of the dataset (number of points to embed) and the embedding dimensionality are \nseveral-fold larger than in the applications studied in [3], making the optimization methods \nused by these authors impractical. Instead, we use a simple and intuitive stochastic -gradient \nprocedure. Importantly, in order to handle both the large dataset and the relatively high \ndimensionality of the embedding needed for this application, we constrain the embedding to \nlie on the unit sphere. We therefore refer to this method as Spherical CODE, or S-CODE. \nThe spherical constraint causes the regularization term\u2014the partition function\u2014to be nearly \nconstant and also makes the stochastic gradient ascent smoother ; this allows a several-fold \ncomputational improvement, and yields excellent performance. After convergence of the \nembedding model, we use a k-means algorithm to cluster all the words of the corpus, based \non their embeddings. The induced POS labels are evaluated using the standard setting for \nthis task, yielding state-of-the-art tagging performance. \n\n \n2 \n2 . 1 M o d e l \n\nM e t h o d s \n\nWe represent a bigram, i.e., an ordered pair of adjacent words in the corpus, as joint random \nvariables (X,Y), each taking values in W, the set of word types occurring in the corpus. \nSince X and Y, the first and second words in a bigram, play different roles, we build a \nheterogeneous model, i.e., use two embedding functions, and . Both map W into S, \nthe unit sphere in the r-dimensional Euclidean space. \n\nWe use for the word-type frequencies: is the number of word tokens of type x divided \nby the total number of tokens in the corpus. We refer to as the empirical marginal \ndistribution, or unigram frequency. We use for the empirical joint distribution of X \nand Y, i.e., the distribution of bigrams (X,Y). Because our ultimate goal is the clustering of \nword types for POS tagging, we want the embedding to be insensitive to the marginals: two \nword types with similar context distributions should be mapped to neighboring points in S \neven if their unigram frequencies are very different. We therefore use the marginal-marginal \nmodel of [3], defined by: \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n\n \n\n \n\n \n\n \n\n \n \n\n \n \n\n \n \n\n \n\n \n \n\n \n\n \n\n \n\n(1) \n\n(2) \n\n(3) \n\nThe log-likelihood, \uf06c, of the corpus of bigrams is the expected value, under the empirical \nbigram distribution, of the log of the model bigram probability: \n\n \n \n\n \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n(4) \n \n\n\fThe model is parameterized by 2\u00d7|W| points on the unit sphere S in r dimensions: \nand . These points are initialized randomly, i.e., independently and uniformly on \nS. \n \nTo maximize the likelihood, we use a gradient-ascent approach. The gradient of the log \nlikelihood is as follows (observe that the last term in (4) does not depend on the model, \nhence does not contribute to the gradient): \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n\n \n\n \n\n \n\n(5) \n\n \n\n \n\n \n \n\n \n\n \n\n(6) \n\n \nFor sufficiently large problems such as POS tagging of a large corpus, computing the \npartition function, Z, after each gradient step or even once every fixed number of steps can \nbe impractical. Instead, it turns out (see Discussion) that, thanks to the sphere constraint, we \ncan approximate this dynamic variable, Z, using a constant, , which arises from a coarse \napproximation in which all pairs of embedded variables are distributed uniformly and \nindependently on the sphere. Thus, we set with and i.i.d. \nuniformly on S, and get our estimate as the expected value of the resulting random \nvariable, \n \n\n: \n\n \n\n . \n\n(7) \n\n \n \nNumerical evaluation of (7) yields for the 25-dimensional sphere. An even coarser \napproximation can be obtained by noting that, for large r, the random variable \n is fairly peaked around 2 (the random variable is close to a Student's t \nwith r degrees of freedom, compressed by a factor of ). This yields the estimate \n . \n \nFor the present application, we find that performance does not suffer from using a constant \nrather than recomputing Z often during gradient-ascent. It is also fairly robust to the choice \nof . We observe only minor changes in performance for ranging over [0.1, 0.5]. \n \nWe use sampling to compute a stochastic approximation of the gradient. To implement the \nfirst sum in (5) and (6) \u2212 representing an attraction force between the embeddings of the \nwords in a bigram \u2212 we sample bigrams from the empirical joint . Given a sample \n , only the and parameter vectors are updated. The partial updates that \nemerge from these two sums are: \n \n \n \n\n(8) \n(9) \n \nwhere is the step size. In order to speed up the convergence process, we use a learning rate \nthat decreases as word types are repeatedly observed. If is the number of times word \ntype w has been previously encountered, we use: \n \n\n \n , \n\n \n\n \n\n \n\n \n\n . \n\n(10) \n\n \nThe model is very robust to the choice of the function (C), as long as it decreases smoothly. \nThis modified learning rate also reduces the variability of the tagging accuracy, while \nslightly increasing its mean. \n \nThe second sum in (5) and in (6) \u2212 representing a repulsion force \u2212 involves not the \nempirical joint but the product of the empirical marginals. T hus, the complete update is: \n\n\f \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n(11) \n\n , \n\n(12) \n\nE v a l u a t i o n a n d d a t a \n\n \nwhere is sampled from the joint , and x2 and y2 are sampled from the marginal \n independently from each other and independently from x1 and y1. After each step, the \nupdated vectors are projected back onto the sphere S. \n \nAfter convergence, for any word w, we have two embedded vectors, and . These \nvectors are concatenated to form a single geometric description of word type w. The \ncollection of all these vectors is then clustered using a weighted k-means clustering \nalgorithm: in each iteration, a cluster\u2019s centroid is updated as the weighted mean of its \ncurrently assigned constituent vectors, with the weight of the vector for word w equal to \n . The number of clusters chosen depends on whether evaluation is to be done against \nthe PTB45 or the PTB17 tagset (see below, Section 2.2). 1 \n \n2 . 2 \n \nThe resulting assignment of cluster labels to word types is used to label the corpus. The \nstandard practice for evaluating the performance of the induced labels is to either map them \nto the gold-standard tags, or to use an information-theoretic measure. We use the three \nevaluation criteria that are most common in the recent literature. The first criterion maps \neach cluster to the POS tag that it best matches according to the hand -annotated labels. The \nmatch is determined by finding the tag that is most frequently assigned to any token of any \nword type in the cluster. Because the criterion is free to assign several clusters to the same \nPOS tag, this evaluation technique is called many-to-one mapping, or MTO. Once the map \nis constructed, the accuracy score is obtained as the fraction of all tokens whose inferred tag \nunder the map matches the hand-annotated tag. \n \nThe second criterion, 1-to-1 mapping, is similar to the first, but the mapping is restricted \nfrom assigning multiple clusters to a single tag; hence it is called one -to-one mapping, or 1-\nto-1. Most authors construct the 1-to-1 mapping greedily, assigning maximal-score label-to-\ntag matches first; some authors, e.g. [15], use the optimal map. Once the map is constructed, \nthe accuracy is computed just as in MTO. The third criterion, variation of information, or VI, \nis a map-free information-theoretic metric [9, 2]. \n \nWe note that we and other authors found the most reliable criterion for comparing \nunsupervised POS taggers to be MTO. However, we include all three criteria for \ncompleteness. \n\nWe use the Wall Street Journal part of the Penn Treebank [8] (1,173,766 tokens). We ignore \ncapitalization, leaving 43,766 word types, to compare performance with other models \nconsistently. Evaluation is done against the full tag set (PTB45), and against a coarse tag set \n(PTB17) [12]. For PTB45 evaluation, we use either 45 or 50 clusters, in order for our results \nto be comparable to all recent works. For PTB17 evaluation, we use 17 clusters, as do all \nother authors. \n\n \n3 \n\nR e s u l t s \n\nFigure 1 shows the model performance when evaluated with several measures. MTO17 and \nMTO50 refer to the number of tokens tagged correctly under the many -to-1 mapping for the \nPTB45 and PTB17 tagsets respectively. The type-accuracy curves use the same mapping \n\n \n1 Source code is available at the author\u2019s website: faculty.biu.ac.il/~marony. \n\n\fand tagsets, but record the fraction of word types whose inferred tag matches thei r \"modal\" \nannotated tag, i.e., the annotated tag co-occurring most frequently with this word type. We \nalso show the scaled log likelihood, to illustrate its convergence. These results were \nproduced using a constant, pre-computed, . Using this constant value allows the model to \nrun in a matter of minutes rather than the hours or days required by HMMs and MRFs. \n\nFigure 1: Scores against number of iterations (bigram updates). Scores are averaged over 10 \nsessions, and shown with 1-std error bars. MTO17 is the Many-to-1 tagging accuracy score \nbased on 17 induced labels mapped to 17 tags. MTO50 is the Many-to-1 score based on 50 \ninduced labels mapped to 45 tags. Type Accuracy 17 (50) is the average accuracy per word \ntype, where the gold-standard tag of a word type is the modal annotated tag of that type (see \ntext). All runs used = 0.154, r=25. \n\n \n\n \n\nFigure 2: Comparison of models with different dimensionalities: r = 2, 5, 10, 25. MTO17 is \nthe Many-to-1 score based on 17 induced labels mapped to PTB17 tags. \n \n\n \n\n \n\nFigure 2 shows the model performance for different dimensionalities r. As r increases, so \ndoes the performance. Unlike previous applications of CODE [3] (which often emphasize \n\n \n\n \n\n01020304050600.50.550.60.650.70.75bigram updates (times 100,000) MTO17,r=25MTO17,r=10MTO17,r=5MTO17,r=202040608010012000.10.20.30.40.50.60.70.80.91bigram updates (times 100,000) log-likelihoodMTO17MTO50Type Accuracy 17Type Accuracy 50\fvisualization of data and thus require a low dimension), this unsupervised POS -tagging \napplication benefits from high values of r. Larger values of r cause both the tagging \naccuracy to improve and the variability during convergence to decrease. \n\n \n\nMany-to-1 \n\n1-to-1 \n\nVI \n\nModel \n\nPTB17 PTB45\n\n-45 \n\nPTB45\n-50 \n\nPTB17 PTB45\n\n-45 \n\nPTB45\n-50 \n\nPTB17 PTB45\n\n-45 \n\nPTB45\n-50 \n\nS-CODE \n(Z=0.1456) \n\n73.8 \n(0.5) \n\n68.8 \n(0.16) \n\n70.4 \n(0.5) \n\nS-CODE \n(Z=0.3) \n\n74.5 \n(0.2) \n\n68.6 \n(0.16) \n\n71.5 \n(0.6) \n\n52.2 \n\n50.0 \n\n50.0 \n\n2.93 \n\n3.46 \n\n3.46 \n\n54.9 \n\n48.7 \n\n48.8 \n\n2.80 \n\n3.38 \n\n3.39 \n\nLDC \n\n75.1 \n(0.04) \n\n68.1 \n(0.2) \n\n71.2 \n(0.06) \n\n59.3 \n\n \n\n48.3 \n\nBrown \n\n \n\n67.8 \n\n70.5 \n\n \n\n50.1 \n\n51.3 \n\n \n\n \n\n \n\n \n\n3.47 \n\n3.45 \n\nHMM-EM 64.7 \n\nHMM-VB 63.7 \n\nHMM-GS 67.4 \n\n \n\n \n\n \n\n62.1 \n\n43.1 \n\n60.5 \n\n51.4 \n\n66.0 \n\n44.6 \n\n \n\n \n\n \n\n40.5 \n\n3.86 \n\n46.1 \n\n3.44 \n\n49.9 \n\n3.46 \n\nHMM-\nSparse(32) \n\nVEM \n (10-1,10-1) \n\n70.2 \n\n65.4 \n\n68.2 \n\n54.6 \n\n \n\n \n\n49.5 \n\n44.5 \n\n52.8 \n\n46.0 \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n4.48 \n\n4.28 \n\n4.04 \n\n \n\n \n\nTable 1: Comparison to other models, under three different evaluation measures. S-CODE \nuses r = 25 dimensions. It was run 10 times, each with 12\u00b7106 update steps. LDC is from \n[15]; Brown shows the best results from [14] and website mentioned therein; HMM-EM, \nHMM-VB and HMM-GS show the best results from [2]; HMM-Sparse(32) and VEM show \nthe best results from [5]. The numbers in parentheses are standard deviations. For the VI \ncriterion, lower values are better. PTB45-45 maps 45 induced labels to 45 tags, while \nPTB45-50 maps 50 induced labels to 45 tags. \n \nTable 1 compares our model, S-CODE, to previous state-of-the-art approaches. Under the \nMany-to-1 criterion, which we find to be the most appropriate of the three for the evaluation \nof unsupervised POS taggers, S-CODE is superior to HMM results, and scores comparably \nto [15], the highest-performing model to date on this task. \n\n \n\nWe find that the model is very robust to the choice of within the range 0.1 to 0.5. This \nrobustness lends promise for the usefulness of this method for other applications in which \nthe partition function is impractical to compute. This point is discussed further in the next \nsection. \n\n \n4 \n\nD i s c u s s i o n \n\nThe problem of embedding heterogeneous categorical data (X,Y) based on their co-\noccurrence statistics may be formulated as the task of finding a pair of maps and \nsuch that, for any pair (x,y), the distance between the images of x and y reflects the statistical \ninteraction between them. Such embeddings have been used mostly for the purpose of \nvisualization and exploratory data analysis. Here we demonstrate that emb edding can be \nsuccessfully applied to a well-studied computational-linguistics task, achieving state-of-the-\nart performance. \n \n4 . 1 \n\nS - C O D E v. C O D E \n\nThe approach proposed here, S-CODE, is a variant of the CODE model of [3]. In the task at \nhand, the sets X and Y to be embedded are large (43K), making most conventional \n\n\fembedding approaches, including CODE (as implemented in [3]), impractical. As explained \nbelow, S-CODE overcomes the large-dataset challenge by constraining the maps to lie on the \nunit sphere. It uses stochastic gradient ascent to maximize the likelihood of the model. \n\nThe gradient of the log-likelihood w.r.t. a given includes two components, each with a \nsimple intuitive meaning. The first component embodies an attraction force, pulling \ntoward in proportion to the empirical joint . The second component, the gradient \nof the regularization term, , embodies a repulsion force; it keeps the solution away \nfrom the trivial state where all x's and y's are mapped to the same point, and more generally \nattempts to keep Z small. The repulsion force pushes away from in proportion to \n \nthe product of the empirical marginals and , and is scaled by \n . The \ncomputational complexity of Z, the partition function, is . \n\nIn the application studied here, the use of the spherical constraint of S -CODE has two \nimportant consequences. First, it makes the computation of Z unnecessary. Indeed, when \nusing the spherical constraint, we observed that Z, when actually computed and updated \nevery 106 steps, does not deviate much from its initial value. For example, for r = 25, Z rises \nsmoothly from 0.145 to 0.182. Note that the absolute minimum of Z\u2014obtained for a that \nmaps all of W to a single point on S and a that maps all of W to the opposite point\u2014is \n ; the absolute maximum of Z, obtained for and that map all of W to the same point, is \n1. We also observed that replacing Z, in the update algorithm, by any constant in the range \n[.1 .5] does not dramatically alter the behavior of the model. We nevertheless note that larger \nvalues of tend to yield a slightly higher performance of the POS tagger built from the \nmodel. Note that the only effect of changing in the stochastic gradient algorithm is to \nchange the relative strength of the attraction and repulsion terms. \n\nWe compared the performance of S-CODE with CODE. The original CODE implementation \n[3] could not support the size of our data set. To overcome this limitation, we used the \nstochastic-gradient method described above, but without projecting to the sphere. This \nrequired us to compute the partition function, which is highly computationally intensive. We \ntherefore computed the partition function only once every q update steps (where one update \nstep is the sampling of one bigram). We found that for q = 105 the partition function and \nlikelihood changed smoothly enough and converged, and the embeddings yielded tagging \nperformances that did not differ significantly from those obtained with S -CODE. The second \nimportant consequence of imposing the spherical constraint is that it makes the stochastic \ngradient-ascent procedure markedly smoother. As a result, a relatively large step size can be \nused, achieving convergence and excellent tagging performance in about 10 minutes of \ncomputation time on a desktop machine. CODE requires a smaller step size as well as the \nrecomputation of the partition function, and, as a result, computation time in this application \nwas 6 times longer than with S-CODE. \n\nWhen gauging the applicability of S-CODE to different large-scale embedding problems, \none should try to gain some understanding of why the spherical constraint stabilizes the \npartition function, and whether Z will stabilize around the same value for other problems. \nThe answer to the first question appears to be that the regularization term is not so strong as \nto prevent clusters from forming\u2014this is demonstrated by the excellent performance of the \nmodel when used for POS tagging\u2014yet it is strong enough to enforce a fairly uniform \ndistribution of these clusters on the sphere\u2014resulting in a fairly stable value of Z. One may \nreasonably conjecture that this behavior will generalize to other problems. To answer the \nsecond question, we note that the order of magnitude of Z is essentially set by the coarsest of \nthe two estimates derived in Section 2, namely 0.135, and that this estimate is \nproblem-independent. As a result, S-CODE is, in principle, applicable to datasets of much \nlarger size than the present problem. The computational complexity of the algorithm is \nO(Nr), and the memory requirement is O(|W|r) where N is the number of word tokens, and \n|W| is the number of word types. In contrast, and as mentioned above, CODE, even in our \nstochastic-gradient version, is considerably more computationally intensive; it would clearly \nbe completely impractical for much larger datasets. \n \n4 . 2 \n\nC o m p a r i s o n t o o t h e r P O S i n d u c t i o n m o d e l s \n\nEven though embedding models have been studied extensively, they are not widely used for \n\n\fPOS tagging (see however [18]). For the unsupervised POS tagging task, HMMs have until \nrecently dominated the field. Here we show that an embedding model substantially \noutperforms HMMs, and achieves the same level of performance as the best distributional-\nonly model to date [15]. Models that use features, e.g. morphological, achieve higher tagging \nprecision [11, 14]. Incorporating features into S-CODE can easily be done, either directly or \nin a two-step approach as in [14]; this is left for future work. \n\nOne of the widely-acknowledged challenges in applying HMMs to the unsupervised POS \ntagging problem is that these models do not afford a convenient vehicle to modeling an \nimportant sparseness property of natural languages, namely the fact that any given word type \nadmits of only a small number of POS tags\u2014often only one (see in particular [7, 2, 4]). In \ncontrast, the approach presented here maps each word type to a single point in . Hence, it \nassigns a single tag to each word type, like a number of other recent approaches [15, 16, 17]. \nThese approaches are incapable of disambiguating, i.e., of assigning different tags to the \nsame word depending on context, as in \"I long to see a long movie.\" HMMs are, in principle, \ncapable of doing so, but at the cost of over-parameterization. In view of the superior \nperformance of S-CODE and of other type-level approaches, it appears that under-\nparameterization might be the better choice for this task. \n\nAnother difference between our model and HMMs previously applied to this problem is that \nour model is symmetric, thereby modeling right and left context distributions. In contrast, \nHMMs are asymmetric in that they typically model a left-to-right transition and would find a \ndifferent solution if a right-to-left transition were modeled. We argue that using both \ndistributions in a symmetric way better captures the important linguistic information. In the \npast, left and right distributions were extracted by factoring the bigram matrix and using the \nleft and right eigenvectors. Such a linear method does not handle rare words well. Instead, \nwe choose to learn the ratio \n. This approach allows words with similar \ncontexts but different unigram frequencies to be embedded near each other. \n\n \n\nLike HMMs, CODE provides a model of the distribution of the data at hand. S -CODE \ndeparts slightly from this framework. Since it does not use the exact partition function in the \nstochastic gradient ascent procedure\u2014and was actually found to perform best when \nreplacing Z, in the update rule, by a constant that is substantially larger than the true value of \nZ\u2014it only approximately converges to a local maximum of a likelihood function. In future \nwork, and as a more radical deviation from the CODE model, one may then give up \naltogether modeling the distribution of X and Y, instead relying on a heuristically motivated \nobjective function of sphere-constrained embeddings and , to be maximized. \nPreliminary studies using a number of alternative functional forms for the regularization \nterm yielded promising results. \n\nAlthough S-CODE and LDC [15] achieve essentially the same level of performance on \ntaggings that induce 17, 45, or 50 labels (Table 1), S-CODE proves superior for the \ninduction of very fine-grained taggings. Thus, we compared the performances of S-CODE \nand LDC on the task of inducing 300 labels. Under the MTO criterion, LDC achieved 80.9% \n(PTB45) and 87.9% (PTB17). S-CODE significantly outperformed it, with 83.5% (PTB45) \nand 89.8% (PTB17). \n\nThe appeal of S-CODE lies not only in its strong performance on the unsupervised POS \ntagging problem, but also in its simplicity, its robustness, and its math ematical grounding. \nThe mathematics underlying CODE, as developed in [3], are intuitive and relatively simple . \nModeling the joint probability of word type co -occurrence through distances between \nEuclidean embeddings, without relying on discrete categories or states, is a novel and \npromising approach for POS tagging. The spherical constraint introduced here permits the \napproximation of the partition function by a constant, which is the key to the efficiency of \nthe algorithm for large datasets. The stochastic-gradient procedure produces two competing \nforces with intuitive meaning, familiar from the literature on learning in generative models. \n While the accuracy and computational efficiency of S-CODE is matched by the recent LDC \nalgorithm [15], S-CODE is more robust, showing very little change in performance over a \nwide range of implementation choices. We expect that this improved robustness will allow \nS-CODE to be easily and successfully applied to other large-scale tasks, both linguistic and \nnon-linguistic. \n\n \n\n\fR e f e r e n c e s \n\n \n[1] Alexander Clark. 2003. Combining distributional and morphological information for part of speech \ninduction. In 10th Conference of the European Chapter of the Association for Computational \nLinguistics, pages 59\u201366. \n\n[2] Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised Hidden \nMarkov Model POS taggers. In Proceedings of the 2008 Conference on Empirical Methods in Natural \nLanguage Processing, pages 344\u2013352. \n\n[3] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2007. Euclidean embedding of co-\n\noccurrence data. Journal of Machine Learning Research, 8:2265\u20132295. \n\n[4] Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech \ntagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, \npages 744\u2013751. \n\n[5] Jo\u00e3o V. Gra\u00e7a, Kuzman Ganchev, Ben Taskar, and Fernando Pereira. 2009. Posterior vs. Parameter \n\nSparsity in Latent Variable Models. In Neural Information Processing Systems Conference (NIPS). \n\n[6] Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of \n\nthe Human Language Technology Conference of the NAACL, Main Conference, pages 320\u2013327. \n\n[7] Mark Johnson. 2007. Why doesn\u2019t EM find good HMM POS-taggers? In Proceedings of the 2007 Joint \nConference on Empirical Methods in Natural Language Processing and Computational Natural \nLanguage Learning (EMNLP-CoNLL), pages 296\u2013305. \n\n[8] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of English: \n\nThe Penn Treebank. Computational linguistics, 19(2):313\u2013330. \n\n[9] Marina Meil\u0103. 2003. Comparing clusterings by the variation of information. In Bernhard Sch\u00f6lkopf and \nManfred K. Warmuth, editors, COLT 2003: The Sixteenth Annual Conference on Learning Theory, \nvolume 2777 of Lecture Notes in Computer Science, pages 173\u2013187. Springer. \n\n[10] Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear \n\nembedding. Science, 290:2323\u20132326. \n\n[11] Taylor Berg-Kirkpatrick, Alexandre Bouchard-C\u00f4t\u00e9, John DeNero, and Dan Klein. 2010. Painless \nUnsupervised Learning with Features. In Human Language Technologies: The 2010 Annual \nConference of the North American Chapter of the Association for Computational Linguistics, pages \n582-590. \n\n[12] Noah A. Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled \ndata. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics \n(ACL\u201905), pages 354\u2013362. \n\n[13] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. 2000. A global geometric framework for \n\nnonlinear dimensionality reduction. Science, 290:2319\u20132323. \n\n[14] Christos Christodoulopoulos, Sharon Goldwater and Mark Steedman. 2010. Two Decades of \nUnsupervised POS induction: How far have we come? In Proceedings of the 2010 Conference on \nEmpirical Methods in Natural Language Processing (EMNLP 2010), pages 575\u2013584. \n\n[15] Michael Lamar, Yariv Maron and Elie Bienenstock. 2010. Latent-Descriptor Clustering for \nUnsupervised POS Induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural \nLanguage Processing, pages 799\u2013809. \n\n[16] Yoong Keok Lee, Aria Haghighi, and Regina Barzilay. 2010. Simple Type-Level Unsupervised POS \nTagging. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language \nProcessing, pages 853-861. \n\n[17] Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock. 2010. SVD and clustering for \n\nunsupervised POS tagging. In Proceedings of the ACL 2010 Conference Short Papers, pages 215-219. \n\n[18] Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep \nneural networks with multitask learning. In Proceedings of the Twenty-fifth International Conference \non Machine Learning (ICML 2008), pages 160\u2013167. \n\n\f", "award": [], "sourceid": 1196, "authors": [{"given_name": "Yariv", "family_name": "Maron", "institution": null}, {"given_name": "Michael", "family_name": "Lamar", "institution": ""}, {"given_name": "Elie", "family_name": "Bienenstock", "institution": ""}]}