{"title": "Structured Embedding Models for Grouped Data", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 261, "abstract": "Word embeddings are a powerful approach for analyzing language, and exponential family embeddings (EFE) extend them to other types of data. Here we develop structured exponential family embeddings (S-EFE), a method for discovering embeddings that vary across related groups of data. We study how the word usage of U.S. Congressional speeches varies across states and party affiliation, how words are used differently across sections of the ArXiv, and how the co-purchase patterns of groceries can vary across seasons. Key to the success of our method is that the groups share statistical information. We develop two sharing strategies: hierarchical modeling and amortization. We demonstrate the benefits of this approach in empirical studies of speeches, abstracts, and shopping baskets. We show how SEFE enables group-specific interpretation of word usage, and outperforms EFE in predicting held-out data.", "full_text": "Structured Embedding Models for Grouped Data\n\nMaja Rudolph\nColumbia Univ.\n\nmaja@cs.columbia.edu\n\nFrancisco Ruiz\n\nUniv. of Cambridge\n\nColumbia Univ.\n\nSusan Athey\nStanford Univ.\n\nDavid Blei\n\nColumbia Univ.\n\nAbstract\n\nWord embeddings are a powerful approach for analyzing language, and exponential\nfamily embeddings (EFE) extend them to other types of data. Here we develop\nstructured exponential family embeddings (S-EFE), a method for discovering\nembeddings that vary across related groups of data. We study how the word\nusage of U.S. Congressional speeches varies across states and party af\ufb01liation,\nhow words are used differently across sections of the ArXiv, and how the co-\npurchase patterns of groceries can vary across seasons. Key to the success of our\nmethod is that the groups share statistical information. We develop two sharing\nstrategies: hierarchical modeling and amortization. We demonstrate the bene\ufb01ts\nof this approach in empirical studies of speeches, abstracts, and shopping baskets.\nWe show how S-EFE enables group-speci\ufb01c interpretation of word usage, and\noutperforms EFE in predicting held-out data.\n\n1\n\nIntroduction\n\nWord embeddings (Bengio et al., 2003; Mikolov et al., 2013d,c,a; Pennington et al., 2014; Levy\nand Goldberg, 2014; Arora et al., 2015) are unsupervised learning methods for capturing latent\nsemantic structure in language. Word embedding methods analyze text data to learn distributed\nrepresentations of the vocabulary that capture its co-occurrence statistics. These representations are\nuseful for reasoning about word usage and meaning (Harris, 1954; Rumelhart et al., 1986). Word\nembeddings have also been extended to data beyond text (Barkan and Koenigstein, 2016; Rudolph\net al., 2016), such as items in a grocery store or neurons in the brain. Exponential family embeddings\n(EFE) is a probabilistic perspective on embeddings that encompasses many existing methods and\nopens the door to bringing expressive probabilistic modeling (Bishop, 2006; Murphy, 2012) to the\nproblem of learning distributed representations.\nWe develop structured exponential family embeddings (S-EFE), an extension of EFE for studying\nhow embeddings can vary across groups of related data. We will study several examples: in U.S.\nCongressional speeches, word usage can vary across states or party af\ufb01liations; in scienti\ufb01c literature,\nthe usage patterns of technical terms can vary across \ufb01elds; in supermarket shopping data, co-purchase\npatterns of items can vary across seasons of the year. We will see that S-EFE discovers a per-group\nembedding representation of objects. While the na\u00efve approach of \ufb01tting an individual embedding\nmodel for each group would typically suffer from lack of data\u2014especially in groups for which fewer\nobservations are available\u2014we develop two methods that can share information across groups.\nFigure 1a illustrates the kind of variation that we can capture. We \ufb01t an S-EFE to ArXiv abstracts\ngrouped into different sections, such as computer science (cs), quantitative \ufb01nance (q-\ufb01n), and\nnonlinear sciences (nlin). S-EFE results in a per-section embedding of each term in the vocabulary.\nUsing the \ufb01tted embeddings, we illustrate similar words to the word INTELLIGENCE. We can see\nthat how INTELLIGENCE is used varies by \ufb01eld: in computer science the most similar words include\nARTIFICIAL and AI; in \ufb01nance, similar words include ABILITIES and CONSCIOUSNESS.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u03c1(s)\nv\n\n\u03b1v\n\nV\n\nX (s)\n\nhierarchical:\namortized:\n\nS\nv \u223c N (\u03c1(0)\n\u03c1(s)\nv = fs(\u03c1(0)\n\u03c1(s)\nv )\n\nv , \u03c32\n\n\u03c1I)\n\n(a) S-EFE uncover variations in the usage of the word INTELLIGENCE.\nFigure 1: (a) INTELLIGENCE is used differently across the ArXiv sections. Words with the closest\nembedding to the query are listed for 5 sections. (The embeddings were obtained by \ufb01tting an\namortized S-EFE.) The method automatically orders the sections along the horizontal axis by their\nsimilarity in the usage of INTELLIGENCE. See Section 3 additional for details. (b) Graphical\nrepresentation of S-EFE for data in S categories. The embedding vectors \u03c1(s)\nare speci\ufb01c to each\nv\ngroup, and the context vectors \u03b1v are shared across all categories.\n\n(b) Graphical repres. of S-EFE.\n\nIn more detail, embedding methods posit two representation vectors for each term in the vocabulary;\nan embedding vector and a context vector. (We use the language of text for concreteness; as we\nmentioned, EFE extend to other types of data.) The idea is that the conditional probability of each\nobserved word depends on the interaction between the embedding vector and the context vectors of\nthe surrounding words. In S-EFE, we posit a separate set of embedding vectors for each group but a\nshared set of context vectors; this ensures that the embedding vectors are in the same space.\nWe propose two methods to share statistical strength among the embedding vectors. The \ufb01rst\napproach is based on hierarchical modeling (Gelman et al., 2003), which assumes that the group-\nspeci\ufb01c embedding representations are tied through a global embedding. The second approach is\nbased on amortization (Dayan et al., 1995; Gershman and Goodman, 2014), which considers that the\nindividual embeddings are the output of a deterministic function of a global embedding representation.\nWe use stochastic optimization to \ufb01t large data sets.\nOur work relates closely to two threads of research in the embedding literature. One is embedding\nmethods that study how language evolves over time (Kim et al., 2014; Kulkarni et al., 2015; Hamilton\net al., 2016; Rudolph and Blei, 2017; Bamler and Mandt, 2017; Yao et al., 2017). Time can be thought\nof as a type of \u201cgroup\u201d, though with evolutionary structure that we do not consider. The second\nthread is multilingual embeddings (Klementiev et al., 2012; Mikolov et al., 2013b; Ammar et al.,\n2016; Zou et al., 2013); our approach is different in that most words appear in all groups and we are\ninterested in the variations of the embeddings across those groups.\nOur contributions are thus as follows. We introduce the S-EFE model, extending EFE to grouped\ndata. We present two techniques to share statistical strength among the embedding vectors, one based\non hierarchical modeling and one based on amortization. We carry out a thorough experimental study\non two text databases, ArXiv papers by section and U.S. Congressional speeches by home state and\npolitical party. Using Poisson embeddings, we study market basket data from a large grocery store,\ngrouped by season. On all three data sets, S-EFE outperforms EFE in terms of held-out log-likelihood.\nQualitatively, we demonstrate how S-EFE discovers which words are used most differently across U.S.\nstates and political parties, and show how word usage changes in different ArXiv disciplines.\n\n2 Model Description\n\nIn this section, we develop structured exponential family embeddings (S-EFE), a model that builds\non exponential family embeddings (EFE) (Rudolph et al., 2016) to capture semantic variations across\ngroups of data. In embedding models, we represent each object (e.g., a word in text, or an item in\nshopping data) using two sets of vectors, an embedding vector and a context vector. In this paper, we\n\n2\n\n\fare interested in how the embeddings vary across groups of data, and for each object we want to learn\na separate embedding vector for each group. Having a separate embedding for each group allows\nus to study how the usage of a word like INTELLIGENCE varies across categories of the ArXiv, or\nwhich words are used most differently by U.S. Senators depending on which state they are from and\nwhether they are Democrats or Republicans.\nThe S-EFE model extends EFE to grouped data, by having the embedding vectors be speci\ufb01c for each\ngroup, while sharing the context vectors across all groups. We review the EFE model in Section 2.1.\nWe then formalize the idea of sharing the context vectors in Section 2.2, where we present two\napproaches to build a hierarchical structure over the group-speci\ufb01c embeddings.\n\n2.1 Background: Exponential Family Embeddings\n\nIn exponential family embeddings, we have a collection of objects, and our goal is to learn a vector\nrepresentation of these objects based on their co-occurrence patterns.\nLet us consider a dataset represented as a (typically sparse) matrix X, where columns are datapoints\nand rows are objects. For example, in text, each column corresponds to a location in the text, and\neach entry xvi is a binary variable that indicates whether word v appears at location i.\nIn EFE, we represent each object v with two sets of vectors, embeddings vectors \u03c1v[i] and context\nvectors \u03b1v[i], and we posit a probability distribution of data entries xvi in which these vectors interact.\nThe de\ufb01nition of the EFE model requires three ingredients: a context, a conditional exponential\nfamily, and a parameter sharing structure. We next describe these three components.\nExponential family embeddings learn the vector representation of objects based on the conditional\nprobability of each observation, conditioned on the observations in its context. The context cvi gives\nthe indices of the observations that appear in the conditional probability distribution of xvi. The\nde\ufb01nition of the context varies across applications. In text, it corresponds to the set of words in a\n\ufb01xed-size window centered at location i.\nGiven the context cvi and the corresponding observations xcvi indexed by cvi, the distribution for xvi\nis in the exponential family,\n\nxvi | xcvi \u223c ExpFam (t(xvi), \u03b7v(xcvi)) ,\n\n(1)\nwith suf\ufb01cient statistics t(xvi) and natural parameter \u03b7v(xcvi). The parameter vectors interact in the\nconditional probability distributions of each observation xvi as follows. The embedding vectors \u03c1v[i]\nand the context vectors \u03b1v[i] are combined to form the natural parameter,\n\n\uf8eb\uf8ed\u03c1v[i](cid:62) (cid:88)\n\n(v(cid:48),i(cid:48))\u2208cvi\n\n\uf8f6\uf8f8 ,\n\n\u03b1v(cid:48)[i(cid:48)]xv(cid:48)i(cid:48)\n\n\u03b7v(xcvi) = g\n\n(2)\n\n(3)\n\nwhere g(\u00b7) is the link function. Exponential family embeddings can be understood as a bank of\ngeneralized linear models (GLMs). The context vectors are combined to give the covariates, and the\n\u201cregression coef\ufb01cients\u201d are the embedding vectors. In Eq. 2, the link function g(\u00b7) plays the same\nrole as in GLMs and is a modeling choice. We use the identity link function.\nThe third ingredient of the EFE model is the parameter sharing structure, which indicates how the\nembedding vectors are shared across observations. In the standard EFE model, we use \u03c1v[i] \u2261 \u03c1v and\n\u03b1v[i] \u2261 \u03b1v for all columns of X. That is, each unique object v has a shared representation across all\ninstances.\nThe objective function. In EFE, we maximize the objective function, which is given by the sum of\nthe log-conditional likelihoods in Eq. 1. In addition, we add an (cid:96)2-regularization term (we use the\nnotation of the log Gaussian pdf) over the embedding and context vectors, yielding\n\nL = log p(\u03b1) + log p(\u03c1) +\n\n(cid:88)\n\nlog p(cid:0)xvi\n\n(cid:12)(cid:12) xcvi; \u03b1, \u03c1(cid:1) ,\n\nNote that maximizing the regularized conditional likelihood is not equivalent to maximum a posteriori.\nRather, it is similar to maximization of the pseudo-likelihood in conditionally speci\ufb01ed models\n(Arnold et al., 2001; Rudolph et al., 2016).\n\nv,i\n\n3\n\n\f2.2 Structured Exponential Family Embeddings\n\nv\n\nHere, we describe the S-EFE model for grouped data. In text, some examples of grouped data are\nCongressional speeches grouped into political parties or scienti\ufb01c documents grouped by discipline.\nOur goal is to learn group-speci\ufb01c embeddings from data partitioned into S groups, i.e., each instance\ni is associated with a group si \u2208 {1, . . . , S}. The S-EFE model extends EFE to learn a separate set\nof embedding vectors for each group.\nTo build the S-EFE model, we impose a particular parameter sharing structure over the set of\nembedding and context vectors. We posit a structured model in which the context vectors are shared\nacross groups, i.e., \u03b1v[i] \u2261 \u03b1v (as in the standard EFE model), but the embedding vectors are only\nshared at the group level, i.e., for an observation i belonging to group si, \u03c1v[i] \u2261 \u03c1(si)\n. Here, \u03c1(s)\nv\ndenotes the embedding vector corresponding to group s. We show a graphical representation of the\nS-EFE in Figure 1b.\nSharing the context vectors \u03b1v has two advantages. First, the shared structure reduces the number of\nparameters, while the resulting S-EFE model is still \ufb02exible to capture how differently words are\nused across different groups, as \u03c1(s)\nis allowed to vary.1 Second, it has the important effect of uniting\nv\nall embedding parameters in the same space, as the group-speci\ufb01c vectors \u03c1(s)\nv need to agree with the\ncomponents of \u03b1v. While one could learn a separate embedding model for each group, as has been\ndone for text grouped into time slices (Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016),\nthis approach would require ad-hoc postprocessing steps to align the embeddings.2\nWhen there are S groups, the S-EFE model has S times as many embedding vectors than the standard\nembedding model. This may complicate inferences about the group-speci\ufb01c vectors, especially for\ngroups with less data. Additionally, an object v may appear with very low frequency in a particular\ngroup. Thus, the na\u00efve approach for building the S-EFE model without additional structure may be\ndetrimental for the quality of the embeddings, especially for small-sized groups. To address this\nproblem, we propose two different methods to tie the individual \u03c1(s)\ntogether, sharing statistical\nv\nstrength among them. The \ufb01rst approach consists in a hierarchical embedding structure. The second\napproach is based on amortization. In both methods, we introduce a set of global embedding vectors\nv , and impose a particular structure to generate \u03c1(s)\n\u03c1(0)\nv\nHierarchical embedding structure. Here, we impose a hierarchical structure that allows sharing\nstatistical strength among the per-group variables. For that, we assume that each \u03c1(s)\n\u03c1I),\nwhere \u03c32\n\nv \u223c N (\u03c1(0)\n\u03c1 is a \ufb01xed hyperparameter. Thus, we replace the EFE objective function in Eq. 3 with\nLhier = log p(\u03b1) + log p(\u03c1(0)) +\n\n(cid:12)(cid:12) xcvi; \u03b1, \u03c1(cid:1) .\n\nlog p(\u03c1(s) | \u03c1(0)) +\n\n(cid:88)\n\nlog p(cid:0)xvi\n\nv , \u03c32\n\n(4)\n\nfrom \u03c1(0)\nv .\n\n(cid:88)\n\ns\n\nv,i\n\nwhere the (cid:96)2-regularization term now applies only on \u03b1v and the global vectors \u03c1(0)\nv .\nFitting the hierarchical model involves maximizing Eq. 4 with respect to \u03b1v, \u03c1(0)\nv , and \u03c1(s)\nv . We\nnote that we have not reduced the number of parameters to be inferred; rather, we tie them together\nthrough a common prior distribution. We use stochastic gradient ascent to maximize Eq. 4.\nAmortization. The idea of amortization has been applied in the literature to develop amortized\ninference algorithms (Dayan et al., 1995; Gershman and Goodman, 2014). The main insight behind\namortization is to reuse inferences about past experiences when presented with a new task, leveraging\nthe accumulated knowledge to quickly solve the new problem. Here, we use amortization to control\nthe number of parameters of the S-EFE model. In particular, we set the per-group embeddings \u03c1(s)\nto\nv\nbe the output of a deterministic function of the global embedding vectors, \u03c1(s)\nv ). We use a\ndifferent function fs(\u00b7) for each group s, and we parameterize them using neural networks, similarly\nto other works on amortized inference (Korattikara et al., 2015; Kingma and Welling, 2014; Rezende\net al., 2014; Mnih and Gregor, 2014). Unlike standard uses of amortized inference, in S-EFE the\n\nv = fs(\u03c1(0)\n\n1Alternatively, we could share the embedding vectors \u03c1v and have group-speci\ufb01c context vectors \u03b1(s)\n\nv . We\n\ndid not explore that avenue and leave it for future work.\n\n2Another potential advantage of the proposed parameter sharing structure is that, when the context vectors\nare held \ufb01xed, the resulting objective function is convex, by the convexity properties of exponential families\n(Wainwright and Jordan, 2008).\n\n4\n\n\finput to the functions fs(\u00b7) is unobserved and must be estimated together with the parameters of the\nfunctions fs(\u00b7).\nDepending on the architecture of the neural networks, the amortization can signi\ufb01cantly reduce the\nnumber of parameters in the model (as compared to the non-amortized model), while still having\nthe \ufb02exibility to model different embedding vectors for each group. The number of parameters in\nthe S-EFE model is KL(S + 1), where S is the number of groups, K is the dimensionality of the\nembedding vectors, and L is the number of objects (e.g., the vocabulary size). With amortization,\nwe reduce the number of parameters to 2KL + SP , where P is the number of parameters of the\nneural network. Since typically L (cid:29) P , this corresponds to a signi\ufb01cant reduction in the number of\nparameters, even when P scales linearly with K.\nIn the amortized S-EFE model, we need to introduce a new set of parameters \u03c6(s) \u2208 RP for each\ngroup s, corresponding to the neural network parameters. Given these, the group-speci\ufb01c embedding\nvectors \u03c1(s)\nv\n\nare obtained as\n\n(5)\nWe compare two architectures for the function fs(\u00b7): fully connected feed-forward neural networks\nand residual networks (He et al., 2016). For both, we consider one hidden layer with H units. Hence,\nthe network parameters \u03c6(s) are two weight matrices,\n\n\u03c1(s)\nv = fs(\u03c1(0)\n\nv ) = f (\u03c1(0)\n\nv ; \u03c6(s)).\n\n\u03c6(s) = {W (s)\n\n1 \u2208 RH\u00d7K, W (s)\n\n2 \u2208 RK\u00d7H},\n\n(6)\n\ni.e., P = 2KH parameters. The neural network takes as input the global embedding vector \u03c1(0)\nv ,\nand it outputs the group-speci\ufb01c embedding vectors \u03c1(s)\nfor a\nfeed-forward neural network and a residual network is respectively given by\n\nv . The mathematical expression for \u03c1(s)\nv\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n\u03c1(s)\nv = fffnet(\u03c1(0)\n\nv ; \u03c6(s)) = W (s)\n\n2\n\ntanh\n\nW (s)\n\n1 \u03c1(0)\nv\n\n,\n\n\u03c1(s)\nv = fresnet(\u03c1(0)\n\nv ; \u03c6(s)) = \u03c1(0)\n\nv + W (s)\n\n2\n\ntanh\n\nW (s)\n\n1 \u03c1(0)\nv\n\n(cid:17)\n\n,\n\n(7)\n\n(8)\n\nwhere we have considered the hyperbolic tangent nonlinearity. The main difference between both\nnetwork architectures is that the residual network focuses on modeling how the group-speci\ufb01c\nembedding vectors \u03c1(s)\nv . That is, if all weights were set to 0, the\nfeed-forward network would output 0, while the residual network would output the global vector \u03c1(0)\nv\nfor all groups.\nThe objective function under amortization is given by\n\nv differ from the global vectors \u03c1(0)\n\nLamortiz = log p(\u03b1) + log p(\u03c1(0)) +\n\nlog p\n\nxvi\n\n.\n\n(9)\n\n(cid:88)\n\n(cid:16)\n\n(cid:17)\n(cid:12)(cid:12) xcvi; \u03b1, \u03c1(0), \u03c6\n\nv,i\n\nWe maximize this objective with respect to \u03b1v, \u03c1(0)\nv , and \u03c6(s) using stochastic gradient ascent. We\nimplement the hierarchical and amortized S-EFE models in TensorFlow (Abadi et al., 2015), which\nallows us to leverage automatic differentiation.3\nExample: structured Bernoulli embeddings for grouped text data. Here, we consider a set\nof documents broken down into groups, such as political af\ufb01liations or scienti\ufb01c disciplines. We\ncan represent the data as a binary matrix X and a set of group indicators si. Since only one word\ncan appear in a certain position i, the matrix X contains one non-zero element per column. In\nembedding models, we ignore this one-hot constraint for computational ef\ufb01ciency, and consider\nthat the observations are generated following a set of conditional Bernoulli distributions (Mikolov\net al., 2013c; Rudolph et al., 2016). Given that most of the entries in X are zero, embedding models\ntypically downweigh the contribution of the zeros to the objective function. Mikolov et al. (2013c)\nuse negative sampling, which consists in randomly choosing a subset of the zero observations. This\ncorresponds to a biased estimate of the gradient in a Bernoulli exponential family embedding model\n(Rudolph et al., 2016).\nThe context cvi is given at each position i by the set of surrounding words in the document, according\nto a \ufb01xed-size window.\n\n3Code is available at https://github.com/mariru/structured_embeddings\n\n5\n\n\fArXiv abstracts\nSenate speeches\nShopping data\n\ndata\ntext\ntext\ncounts\n\nembedding of\n\n15k terms\n15k terms\n5.5k items\n\ngroups\n\n19\n83\n12\n\ngrouped by\nsubject areas\n\nhome state/party\n\nmonths\n\nsize\n\n15M words\n20M words\n0.5M trips\n\nTable 1: Group structure and size of the three corpora analyzed in Section 3.\n\nExample: structured Poisson embeddings for grouped shopping data. EFE and S-EFE extend to\napplications beyond text and we use S-EFE to model supermarket purchases broken down by month.\nFor each market basket i, we have access to the month si in which that shopping trip happened. Now,\nthe rows of the data matrix X index items, while columns index shopping trips. Each element xvi\ndenotes the number of units of item v purchased at trip i. Unlike text, each column of X may contain\nmore than one non-zero element. The context cvi corresponds to the set of items purchased in trip i,\nexcluding v.\nIn this case, we use the Poisson conditional distribution, which is more appropriate for count data. In\nPoisson S-EFE, we also downweigh the contribution of the zeros in the objective function, which\nprovides better results because it allows the inference to focus on the positive signal of the actual\npurchases (Rudolph et al., 2016; Mikolov et al., 2013c).\n\n3 Empirical Study\n\nIn this section, we describe the experimental study. We \ufb01t the S-EFE model on three datasets and\ncompare it against the EFE (Rudolph et al., 2016). Our quantitative results show that sharing the\ncontext vectors provides better results, and that amortization and hierarchical structure give further\nimprovements.\nData. We apply the S-EFE on three datasets: ArXiv papers, U.S. Senate speeches, and purchases on\nsupermarket grocery shopping data. We describe these datasets below, and we provide a summary of\nthe datasets in Table 1.\nArXiv papers: This dataset contains the abstracts of papers published on the ArXiv under the 19\ndifferent tags between April 2007 and June 2015. We treat each tag as a group and \ufb01t S-EFE with the\ngoal of uncovering which words have the strongest shift in usage. We split the abstracts into training,\nvalidation, and test sets, with proportions of 80%, 10%, and 10%, respectively.\nSenate speeches: This dataset contains U.S. Senate speeches from 1994 to mid 2009. In contrast to\nthe ArXiv collection, it is a transcript of spoken language. We group the data into state of origin of\nthe speaker and his or her party af\ufb01liation. Only af\ufb01liations with the Republican and Democratic\nParty are considered. As a result, there are 83 groups (Republicans from Alabama, Democrats from\nAlabama, Republicans from Arkansas, etc.). Some of the state/party combinations are not available\nin the data, as some of the 50 states have only had Senators with the same party af\ufb01liation. We split\nthe speeches into training (80%), validation (10%), and testing (10%).\nGrocery shopping data: This dataset contains the purchases of 3, 206 customers. The data covers a\nperiod of 97 weeks. After removing low-frequency items, the data contains 5, 590 unique items at the\nUPC (Universal Product Code) level. We split the data into a training, test, and validation sets, with\nproportions of 90%, 5%, and 5%, respectively. The training data contains 515, 867 shopping trips\nand 5, 370, 623 purchases in total.\nFor the text corpora, we \ufb01x the vocabulary to the 15k most frequent terms and remove all words\nthat are not in the vocabulary. Following Mikolov et al. (2013c), we additionally remove each word\n\nwith probability 1 \u2212(cid:112)10\u22125/fv, where fv is the word frequency. This downsamples especially the\n\nfrequent words and speeds up training. (Sizes reported in Table 1 are the number of words remaining\nafter preprocessing.)\nModels. Our goal is to \ufb01t the S-EFE model on these datasets. For the text data, we use the Bernoulli\ndistribution as the conditional exponential family, while for the shopping data we use the Poisson\ndistribution, which is more appropriate for count data.\n\n6\n\n\fOn each dataset, we compare four approaches based on S-EFE with two EFE (Rudolph et al., 2016)\nbaselines. All are \ufb01t using stochastic gradient descent (SGD) (Robbins and Monro, 1951). In\nparticular, we compare the following methods:\n\n\u2022 A global EFE model, which cannot capture group structure.\n\u2022 Separate EFE models, \ufb01tted independently on each group.\n\u2022 (this paper) S-EFE without hierarchical structure or amortization.\n\u2022 (this paper) S-EFE with hierarchical group structure.\n\u2022 (this paper) S-EFE, amortized with a feed-forward neural network (Eq. 7).\n\u2022 (this paper) S-EFE, amortized using a residual network (Eq. 8).\n\n6/\n\n\u221a\n\ndrawn from a uniform distribution bounded at \u00b1\u221a\n\nExperimental setup and hyperparameters. For text we set the dimension of the embeddings to\nK = 100, the number of hidden units to H = 25, and we experiment with two context sizes, 2 and 8.4\nIn the shopping data, we use K = 50 and H = 20, and we randomly truncate the context of baskets\nlarger than 20 to reduce their size to 20. For both methods, we use 20 negative samples.\nFor all methods, we subsample minibatches of data in the same manner. Each minibatch contains\nsubsampled observations from all groups and each group is subsampled proportionally to its size.\nFor text, the words subsampled from within a group are consecutive, and for shopping data the\nobservations are sampled at the shopping trip level. This sampling scheme reduces the bias from\nimbalanced group sizes. For text, we use a minibatch size of N/10000, where N is the size of the\ncorpus, and we run 5 passes over the data; for the shopping data we use N/100 and run 50 passes.\nWe use the default learning rate setting of TensorFlow for Adam5 (Kingma and Ba, 2015).\nWe use the standard initialization schemes for the neural network parameters. The weights are\nK + H (Glorot and Bengio, 2010). For the\nembeddings, we try 3 initialization schemes and choose the best one based on validation error. In\nparticular, these schemes are: (1) all embeddings are drawn from the Gaussian prior implied by the\nregularizer; (2) the embeddings are initialized from a global embedding; (3) the context vectors are\ninitialized from a global embedding and held constant, while the embeddings vectors are drawn\nrandomly from the prior. Finally, for each method we choose the regularization variance from the set\n{100, 10, 1, 0.1}, also based on validation error.\nRuntime. We implemented all methods in Tensor\ufb02ow. On the Senate speeches, the runtime of\nS-EFE is 4.3 times slower than the runtime of global EFE, hierarchical EFE is 4.6 times slower than\nthe runtime of global EFE, and amortized S-EFE is 3.3 times slower than the runtime of global EFE.\n(The Senate speeches have the most groups and hence the largest difference in runtime between\nmethods.)\nEvaluation metric. We evaluate the \ufb01ts by held-out pseudo (log-)likelihood. For each model,\nwe compute the test pseudo log-likelihood, according to the exponential family distribution used\n(Bernoulli or Poisson). For each test entry, a better model will assign higher probability to the\nobserved word or item, and lower probability to the negative samples. This is a fair metric because the\ncompeting methods all produce conditional likelihoods from the same exponential family.6 To make\nresults comparable, we train and evaluate all methods with the same number of negative samples (20).\nThe reported held out likelihoods give equal weight to the positive and negative samples.\nQuantitative results. We show the test pseudo log-likelihood of all methods in Table 2 and report that\nour method outperforms the baseline in all experiments. We \ufb01nd that S-EFE with either hierarchical\nstructure or amortization outperforms the competing methods based on standard EFE highlighted\nin bold. This is because the global EFE ignores per-group variations, whereas the separate EFE\ncannot share information across groups. The results of the global EFE baseline are better than \ufb01tting\nseparate EFE (the other baseline), but unlike the other methods the global EFE cannot be used for the\nexploratory analysis of variations across groups. Our results show that using a hierarchical S-EFE\nis always better than using the simple S-EFE model or \ufb01tting a separate EFE on each group. The\nhierarchical structure helps, especially for the Senate speeches, where the data is divided into many\n\n4To save space we report results for context size 8 only. Context size 2 shows the same relative performance.\n5Adam needs to track a history of the gradients for each parameter that is being optimized. One advantage\nfrom reducing the number of parameters with amortization is that it results in a reduced computational overhead\nfor Adam (as well as for other adaptive stepsize schedules).\n\n6Since we hold out chunks of consecutive words usually both a word and its context are held out. For all\n\nmethods we have to use the words in the context to compute the conditional likelihoods.\n\n7\n\n\fGlobal EFE (Rudolph et al., 2016)\nSeparated EFE (Rudolph et al., 2016)\nS-EFE\nS-EFE (hierarchical)\nS-EFE (amortiz+feedf)\nS-EFE (amortiz+resnet)\n\nArXiv papers\n\u22122.176 \u00b1 0.005\n\u22122.500 \u00b1 0.012\n\u22122.287 \u00b1 0.007\n\u22122.170 \u00b1 0.003\n\u22122.153 \u00b1 0.004\n\u22122.120 \u00b1 0.004\n\nSenate speeches\n\u22122.239 \u00b1 0.002\n\u22122.915 \u00b1 0.004\n\u22122.645 \u00b1 0.002\n\u2212 2.217 \u00b1 0.001\n\u22122.484 \u00b1 0.002\n\u22122.249 \u00b1 0.002\n\nShopping data\n\u22120.772 \u00b1 0.000\n\u22120.807 \u00b1 0.002\n\u22120.770 \u00b1 0.001\n\u22120.767 \u00b1 0.000\n\u22120.774 \u00b1 0.000\n\u22120.762 \u00b1 0.000\n\nTable 2: Test log-likelihood on the three considered datasets. S-EFE consistently achieves the\nhighest held-out likelihood. The competing methods are the global EFE, which can not capture group\nvariations, and the separate EFE, which cannot share information across groups.\n\ngroups. Among the amortized S-EFE models we developed, at least amortization with residual\nnetworks outperforms the base S-EFE. The advantage of residual networks over feed-forward neural\nnetworks is consistent with the results reported by (He et al., 2016).\nWhile both hierarchical S-EFE and amortized S-EFE share information about the embedding of a\nparticular word across groups (through the global embedding \u03c1(0)\nv ), amortization additionally ties\nthe embeddings of all words within a group (through learning the neural network of that group). We\nhypothesize that for the Senate speeches, which are split into many groups, this is a strong modeling\nconstraint, while it helps for all other experiments.\nStructured embeddings reveal a spectrum of word usage. We have motivated S-EFE with the\nexample that the usage of INTELLIGENCE varies by ArXiv category (Figure 1a). We now explain\nhow for each term the per-group embeddings place the groups on a spectrum. For a speci\ufb01c term\nv } for all groups s, and project them onto a one-dimensional\nv we take its embeddings vectors {\u03c1(s)\nspace using the \ufb01rst component of principal component analysis (PCA). This is a one-dimensional\nsummary of how close the embeddings of v are across groups. Such comparison is posible because\nthe S-EFE shares the context vectors, which grounds the embedding vectors in a joint space.\nThe spectrum for the word INTELLIGENCE along its \ufb01rst principal component is the horizontal\naxis in Figure 1a. The dots are the projections of the group-speci\ufb01c embeddings for that word.\n(The embeddings come from a \ufb01tted S-EFE with feed-forward amortization.) We can see that in an\nunsupervised manner, the method has placed together groups related to physics on one end on the\nspectrum, while computer science, statistics and math are on the other end of the spectrum.\nTo give additional intuition of what the usage of INTELLIGENCE is at different locations on the\nspectrum, we have listed the 8 most similar words for the groups computer science (cs), quantitative\n\ufb01nance (q-\ufb01n), math (math), statistics (stat), and nonlinear sciences (nlin). Word similarities are\ncomputed using cosine distance in the embedding space. Eventhough their embeddings are relatively\nclose to each other on the spectrum, the model has the \ufb02exibility to capture high variabilty in the lists\nof similar words.\nExploring group variations with structured embeddings. The result of the S-EFE also allows us\nto investigate which words have the highest deviation from their average usage for each group. For\nexample, in the Congressional speeches, there are many terms that we do not expect the Senators to\nuse differently (e.g., most stopwords). We might however want to ask a question like \u201cwhich words\ndo Republicans from Texas use most differently from other Senators?\u201d By suggesting an answer,\nour method can guide an exploratory data analysis. For each group s (state/party combination), we\ncompute the top 3 words in argsortv\nTable 3 shows a summary of our \ufb01ndings (the full table is in the Appendix). According to the S-EFE\n(with residual network amortization), Republican Senators from Texas use BORDER and the phrase\nOUR COUNTRY in different contexts than other Senators.\nSome of these variations are probably in\ufb02uenced by term frequency, as we expect Democrats from\nWashington to talk about WASHINGTON more frequently than other states. But we argue that our\nmethod provides more insights than a frequency based analysis, as it is also sensitive to the context in\nwhich a word appears. For example, WASHINGTON might in some groups be used more often in\n\nfrom within the top 1k words.\n\n(cid:16)||\u03c1(s)\n\nv ||(cid:17)\n\n(cid:80)S\n\nv \u2212 1\n\nS\n\nt=1 \u03c1(t)\n\n8\n\n\fTEXAS\nborder\n\nour country\n\niraq\n\nFLORIDA\n\nmedicaid\nprescription\nmedicare\n\nbankruptcy\n\nwater\nwaste\n\nIOWA\n\nagriculture\n\nfarmers\n\nfood\n\nprescription\n\ndrug\ndrugs\n\nWASHINGTON\n\nwashington\n\nenergy\n\noil\n\nTable 3: List of the three most different words for different groups for the Congressional speeches.\nS-EFE uncovers which words are used most differently by Republican Senators (red) and Democratic\nSenators (blue) from different states. The complete table is in the Appendix.\n\nthe context of PRESIDENT and GEORGE, while in others it might appear in the context of DC and\nCAPITAL, or it may refer to the state.\n\n4 Discussion\n\nWe have presented several structured extensions of EFE for modeling grouped data. Hierarchical\nS-EFE can capture variations in word usage across groups while sharing statistical strength between\nthem through a hierarchical prior. Amortization is an effective way to reduce the number of parameters\nin the hierarchical model. The amortized S-EFE model leverages the expressive power of neural\nnetworks to reduce the number of parameters, while still having the \ufb02exibility to capture variations\nbetween the embeddings of each group. Below are practical guidelines for choosing a S-EFE.\nHow can I \ufb01t embeddings that vary across groups of data? To capture variations across groups,\nnever \ufb01t a separate embedding model for each group. We recommend at least sharing the context\nvectors, as all the S-EFE models do. This ensures that the latent dimensions of the embeddings\nare aligned across groups. In addition to sharing context vectors, we also recommend sharing\nstatistical strength between the embedding vectors. In this paper we have presented two ways to do\nso, hierarchical modeling and amortization.\nShould I use a hierarchical prior or amortization? The answer depends on how many groups the\ndata contain. In our experiments, the hierarchical S-EFE works better when there are many groups.\nWith less groups, the amortized S-EFE works better.\nThe advantage of the amortized S-EFE is that it has fewer parameters than the hierarchical model,\nwhile still having the \ufb02exibility to capture across-group variations. The global embeddings in an\namortized S-EFE have two roles. They capture the semantic similarities of the words, and they also\nserve as the input into the amortization networks. Thus, the global embeddings of words with similar\npattern of across-group variation need to be in regions of the embedding space that lead to similar\nmodi\ufb01cations by the amortization network. As the number of groups in the data increases, these two\nroles become harder to balance. We hypothesize that this is why the amortized S-EFE has stronger\nperformance when there are fewer groups.\nShould I use feed-forward or residual networks? To amortize a S-EFE we recommend residual\nnetworks. They perform better than the feed-forward networks in all of our experiments. While the\nfeed-forward network has to output the entire meaning of a word in the group-speci\ufb01c embedding,\nthe residual network only needs the capacity to model how the group-speci\ufb01c embedding differs from\nthe global embedding.\n\nAcknowledgements\n\nWe thank Elliott Ash and Suresh Naidu for the helpful discussions and for sharing the Senate speeches.\nThis work is supported by NSF IIS-1247664, ONR N00014-11-1-0651, DARPA PPAML FA8750-14-\n2-0009, DARPA SIMPLEX N66001-15-C-4032, the Alfred P. Sloan Foundation, and the John Simon\nGuggenheim Foundation. Francisco J. R. Ruiz is supported by the EU H2020 programme (Marie\nSk\u0142odowska-Curie grant agreement 706760).\n\n9\n\n\fReferences\nAbadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G., Davis, A., Dean, J., Devin, M.,\net al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from\ntensor\ufb02ow.org.\n\nAmmar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., and Smith, N. A. (2016). Massively multilingual\n\nword embeddings. arXiv preprint arXiv:1602.01925.\n\nArnold, B. C., Castillo, E., Sarabia, J. M., et al. (2001). Conditionally speci\ufb01ed distributions: an introduction\n\n(with comments and a rejoinder by the authors). Statistical Science, 16(3):249\u2013274.\n\nArora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. (2015). Rand-walk: A latent variable model approach to\n\nword embeddings. arXiv preprint arXiv:1502.03520.\n\nBamler, R. and Mandt, S. (2017). Dynamic word embeddings. International Conference on Machine Learning.\nBarkan, O. and Koenigstein, N. (2016). Item2vec: neural item embedding for collaborative \ufb01ltering. In Machine\n\nLearning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, pages 1\u20136. IEEE.\n\nBengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model. Journal of\n\nmachine learning research, 3(Feb):1137\u20131155.\n\nBishop, C. M. (2006). Machine learning and pattern recognition. Information Science and Statistics. Springer,\n\nHeidelberg.\n\nDayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural Computation,\n\n7(5):889\u2013904.\n\nGelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian data analysis. Chapman and\n\nHall/CRC.\n\nGershman, S. J. and Goodman, N. D. (2014). Amortized inference in probabilistic reasoning. In Proceedings of\n\nthe Thirty-Sixth Annual Conference of the Cognitive Science Society.\n\nGlorot, X. and Bengio, Y. (2010). Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nAistats, volume 9, pages 249\u2013256.\n\nHamilton, W. L., Leskovec, J., and Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of\n\nsemantic change. arXiv preprint arXiv:1605.09096.\n\nHarris, Z. S. (1954). Distributional structure. Word, 10(2-3):146\u2013162.\nHe, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference\n\non Computer Vision and Pattern Recognition.\n\nKim, Y., Chiu, Y.-I., Hanaki, K., Hegde, D., and Petrov, S. (2014). Temporal analysis of language through neural\n\nlanguage models. arXiv preprint arXiv:1405.3515.\n\nKingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations.\n\nKlementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed representations of words.\nKorattikara, A., Rathod, V., Murphy, K. P., and Welling, M. (2015). Bayesian dark knowledge. In Advances in\n\nNeural Information Processing Systems.\n\nKulkarni, V., Al-Rfou, R., Perozzi, B., and Skiena, S. (2015). Statistically signi\ufb01cant detection of linguistic\n\nchange. In Proceedings of the 24th International Conference on World Wide Web, pages 625\u2013635. ACM.\n\nLevy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Neural Information\n\nProcessing Systems, pages 2177\u20132185.\n\nMikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Ef\ufb01cient estimation of word representations in vector\n\nspace. ICLR Workshop Proceedings. arXiv:1301.3781.\n\nMikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation.\n\narXiv preprint arXiv:1309.4168.\n\nMikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013c). Distributed representations of words\n\nand phrases and their compositionality. In Neural Information Processing Systems, pages 3111\u20133119.\n\n10\n\n\fMikolov, T., Yih, W.-T. a., and Zweig, G. (2013d). Linguistic regularities in continuous space word representa-\n\ntions. In HLT-NAACL, pages 746\u2013751.\n\nMnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In International\n\nConference on Machine Learning.\n\nMurphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.\nPennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In\n\nConference on Empirical Methods on Natural Language Processing, volume 14, pages 1532\u20131543.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference\n\nin deep generative models. In International Conference on Machine Learning.\n\nRobbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics,\n\npages 400\u2013407.\n\nRudolph, M. and Blei, D. (2017). Dynamic bernoulli embeddings for language evolution. arXiv preprint at\n\narXiv:1703.08052.\n\nRudolph, M., Ruiz, F., Mandt, S., and Blei, D. (2016). Exponential family embeddings. In Advances in Neural\n\nInformation Processing Systems, pages 478\u2013486.\n\nRumelhart, D. E., Hintont, G. E., and Williams, R. J. (1986). Learning representations by back-propagating\n\nerrors. Nature, 323:9.\n\nWainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305.\n\nYao, Z., Sun, Y., Ding, W., Rao, N., and Xiong, H. (2017). Discovery of evolving semantics through dynamic\n\nword embedding learning. arXiv preprint arXiv:1703.00607.\n\nZou, W. Y., Socher, R., Cer, D. M., and Manning, C. D. (2013). Bilingual word embeddings for phrase-based\n\nmachine translation. In EMNLP, pages 1393\u20131398.\n\n11\n\n\f", "award": [], "sourceid": 193, "authors": [{"given_name": "Maja", "family_name": "Rudolph", "institution": "Columbia University"}, {"given_name": "Francisco", "family_name": "Ruiz", "institution": null}, {"given_name": "Susan", "family_name": "Athey", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}