{"title": "Topic-Partitioned Multinetwork Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 2807, "page_last": 2815, "abstract": "We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We validate the appropriateness of our model by achieving state-of-the-art performance on a link prediction task and by achieving semantic coherence equivalent to that of latent Dirichlet allocation. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus.", "full_text": "Topic-Partitioned Multinetwork Embeddings\n\nPeter Krafft\u2217\n\nCSAIL\nMIT\n\npkrafft@mit.edu\n\nJuston Moore\u2020, Bruce Desmarais\u2021, Hanna Wallach\u2020\n\n\u2020Department of Computer Science, \u2021Department of Political Science\n\n\u2020{jmoore, wallach}@cs.umass.edu\n\nUniversity of Massachusetts Amherst\n\u2021desmarais@polsci.umass.edu\n\nAbstract\n\nWe introduce a new Bayesian admixture model intended for exploratory analy-\nsis of communication networks\u2014speci\ufb01cally, the discovery and visualization of\ntopic-speci\ufb01c subnetworks in email data sets. Our model produces principled vi-\nsualizations of email networks, i.e., visualizations that have precise mathematical\ninterpretations in terms of our model and its relationship to the observed data.\nWe validate our modeling assumptions by demonstrating that our model achieves\nbetter link prediction performance than three state-of-the-art network models and\nexhibits topic coherence comparable to that of latent Dirichlet allocation. We\nshowcase our model\u2019s ability to discover and visualize topic-speci\ufb01c communica-\ntion patterns using a new email data set: the New Hanover County email network.\nWe provide an extensive analysis of these communication patterns, leading us to\nrecommend our model for any exploratory analysis of email networks or other\nsimilarly-structured communication data. Finally, we advocate for principled vi-\nsualization as a primary objective in the development of new network models.\n\n1\n\nIntroduction\n\nThe structures of organizational communication networks are critical to collaborative problem solv-\ning [1]. Although it is seldom possible for researchers to directly observe complete organizational\ncommunication networks, email data sets provide one means by which they can at least partially ob-\nserve and reason about them. As a result\u2014and especially in light of their rich textual detail, existing\ninfrastructure, and widespread usage\u2014email data sets hold the potential to answer many impor-\ntant scienti\ufb01c and practical questions within the organizational and social sciences. While some\nquestions may be answered by studying the structure of an email network as a whole, other, more\nnuanced, questions can only be answered at \ufb01ner levels of granularity\u2014speci\ufb01cally, by studying\ntopic-speci\ufb01c subnetworks. For example, breaks in communication (or duplicated communication)\nabout particular topics may indicate a need for some form of organizational restructuring. In order to\nfacilitate the study of these kinds of questions, we present a new Bayesian admixture model intended\nfor discovering and summarizing topic-speci\ufb01c communication subnetworks in email data sets.\nThere are a number of probabilistic models that incorporate both network and text data. Although\nsome of these models are speci\ufb01cally for email networks (e.g., McCallum et al.\u2019s author\u2013recipient\u2013\ntopic model [2]), most are intended for networks of documents, such as web pages and the links\nbetween them [3] or academic papers and their citations [4]. In contrast, an email network is more\nnaturally viewed as a network of actors exchanging documents, i.e., actors are associated with nodes\nwhile documents are associated with edges. In other words, an email network de\ufb01nes a multinetwork\nin which there may be multiple edges (one per email) between any pair of actors. Perhaps more\nimportantly, much of the recent work on modeling networks and text has focused on tasks such as\n\n\u2217Work done at the University of Massachusetts Amherst\n\n1\n\n\fFigure 1: Our model partitions an observed email network (left) into topic-speci\ufb01c subnetworks\n(right) by associating each author\u2013recipient edge in the observed network with a single topic.\n\npredicting links or detecting communities. Instead, we take a complementary approach and focus on\nexploratory analysis. Speci\ufb01cally, our goal is to discover and visualize topic-speci\ufb01c subnetworks.\nRather than taking a two-stage approach in which subnetworks are discovered using one model and\nvisualized using another, we present a single probabilistic model that partitions an observed email\nnetwork into topic-speci\ufb01c subnetworks while simultaneously producing a visual representation of\neach subnetwork. If network modeling and visualization are undertaken separately, the resultant vi-\nsualizations may not directly re\ufb02ect the model and its relationship to the observed data. Rather, these\nvisualizations provide a view of the model and the data seen through the lens of the visualization\nalgorithm and its associated assumptions, so any conclusions drawn from such visualizations can be\nbiased by artifacts of the visualization algorithm. Producing principled visualizations of networks,\ni.e., visualizations that have precise interpretations in terms of an associated network model and its\nrelationship to the observed data, remains an open challenge in statistical network modeling [5].\nAddressing this open challenge was a primary objective in the development of our new model.\nIn order to discover and visualize topic-speci\ufb01c subnetworks, our model must associate each author\u2013\nrecipient edge in the observed email network with a topic, as shown in Figure 1. Our model draws\nupon ideas from latent Dirichlet allocation (LDA) [6] to identify a set of corpus-wide topics of\ncommunication, as well as the subset of topics that best describe each observed email. We model\nnetwork structure using an approach similar to that of Hoff et al.\u2019s latent space model (LSM) [7] so\nas to facilitate visualization. Given an observed network, LSM associates each actor in the network\nwith a point in K-dimensional Euclidean space. For any pair of actors, the smaller the distance\nbetween their points, the more likely they are to interact. If K = 2 or K = 3, these interaction\nprobabilities, collectively known as a \u201ccommunication pattern\u201d, can be directly visualized in 2- or\n3-dimensional space via the locations of the actor-speci\ufb01c points. Our model extends this idea by\nassociating a K-dimensional Euclidean space with each topic. Observed author\u2013recipient edges are\nexplicitly associated with topics via the K-dimensional topic-speci\ufb01c communication patterns.\nIn the next section, we present the mathematical details of our new model and outline a correspond-\ning inference algorithm. We then introduce a new email data set: the New Hanover County (NHC)\nemail network. Although our model is intended for exploratory analysis, we test our modeling as-\nsumptions via three validation tasks. In Section 4.1, we show that our model achieves better link\nprediction performance than three state-of-the-art network models. We also demonstrate that our\nmodel is capable of inferring topics that are as coherent as those inferred using LDA. Together,\nthese experiments indicate that our model is an appropriate model of network structure and that\nmodeling this structure does not compromise topic quality. As a \ufb01nal validation experiment, we\nshow that synthetic data generated using our model possesses similar network statistics to those of\nthe NHC email network. In Section 4.4, we showcase our model\u2019s ability to discover and visualize\ntopic-speci\ufb01c communication patterns using the NHC network. We give an extensive analysis of\nthese communication patterns and demonstrate that they provide accessible visualizations of email-\nbased collaboration while possessing precise, meaningful interpretations within the mathematical\nframework of our model. These \ufb01ndings lead us to recommend our model for any exploratory anal-\nysis of email networks or other similarly-structured communication data. Finally, we advocate for\nprincipled visualization as a primary objective in the development of new network models.\n\n2 Topic-Partitioned Multinetwork Embeddings\n\nIn this section, we present our new probabilistic generative model (and associated inference algo-\nrithm) for communication networks. For concreteness, we frame our discussion of this model in\n\n2\n\n=++\fr }A\n\nn }N (d)\n\nr = 0 if r = a(d)). Given a real-world email data set D = {{w(d), a(d), y(d)}}D\n\nterms of email data, although it is generally applicable to any similarly-structured communication\ndata. The generative process and graphical model are provided in the supplementary materials.\nA single email, indexed by d, is represented by a set of tokens w(d) = {w(d)\nn=1 that comprise the\ntext of that email, an integer a(d) \u2208 {1, ..., A} indicating the identity of that email\u2019s author, and a\nset of binary variables y(d) = {y(d)\nr=1 indicating whether each of the A actors in the network is\na recipient of that email. For simplicity, we assume that authors do not send emails to themselves\n(i.e., y(d)\nd=1, our\nmodel permits inference of the topics expressed in the text of the emails, a set of topic-speci\ufb01c\nK-dimensional embeddings (i.e., points in K-dimensional Euclidean space) of the A actors in the\nnetwork, and a partition of the full communication network into a set of topic-speci\ufb01c subnetworks.\nAs in LDA [6], a \u201ctopic\u201d t is characterized by a discrete distribution over V word types with prob-\nability vector \u03c6(t). A symmetric Dirichlet prior with concentration parameter \u03b2 is placed over\n\u03a6 = {\u03c6(1), ..., \u03c6(T )}. To capture the relationship between the topics expressed in an email and\nthat email\u2019s recipients, each topic t is also associated with a \u201ccommunication pattern\u201d: an A \u00d7 A\nmatrix of probabilities P (t). Given an email about topic t, authored by actor a, element p(t)\nar is the\nprobability of actor a including actor r as a recipient of that email. Inspired by LSM [7], each com-\nmunication pattern P (t) is represented implicitly via a set of A points in K-dimensional Euclidean\nr (cid:107))\nspace S(t) = {s(t)\nwith s(t)\n2).1 If K = 2 or K = 3, this representation enables\neach topic-speci\ufb01c communication pattern to be visualized in 2- or 3-dimensional space via the lo-\ncations of the points associated with the A actors. It is worth noting that the dimensions of each\nK-dimensional space have no inherent meaning. In isolation, each point s(t)\na conveys no informa-\ntion; however, the distance between any two points has a precise and meaningful interpretation in the\ngenerative process. Speci\ufb01cally, the recipients of any email associated with topic t are more likely\nto be those actors near to the email\u2019s author in the Euclidean space corresponding to that topic.\n\na=1 and a scalar bias term b(t) such that p(t)\n1I) and b(t) \u223c N (\u00b5, \u03c32\n\na }A\na \u223c N (0, \u03c32\n\nra = \u03c3(b(t) \u2212 (cid:107)s(t)\n\na \u2212 s(t)\n\nar = p(t)\n\nr }A\n\nr\n\nn , such that z(d)\n\nn \u223c \u03c6(t) for z(d)\n\nn \u223c \u03b8(d) and w(d)\n\nEach email, indexed by d, has a discrete distribution over topics \u03b8(d). A symmetric Dirichlet prior\nwith concentration parameter \u03b1 is placed over \u0398 = {\u03b8(1), ..., \u03b8(D)}. Each token w(d)\nn is associated\nwith a topic assignment z(d)\nn = t. Our model does\nnot include a distribution over authors; the generative process is conditioned upon their identities.\nThe email-speci\ufb01c binary variables y(d) = {y(d)\nr=1 indicate the recipients of email d and thus the\npresence (or absence) of email-speci\ufb01c edges from author a(d) to each of the A \u2212 1 other actors.\nConsequently, there may be multiple edges (one per email) between any pair of actors, and D de\ufb01nes\na multinetwork over the entire set of actors. We assume that the complete multinetwork comprises T\ntopic-speci\ufb01c subnetworks. In other words, each y(d)\nis associated with some topic t and therefore\nwith topic-speci\ufb01c communication pattern P (t) such that y(d)\nar ) for a(d) = a. A natural\nway to associate each y(d)\nr with a topic would be to draw a topic assignment from \u03b8(d) in a manner\nanalogous to the generation of z(d)\nn ; however, as outlined by Blei and Jordan [8], this approach can\nresult in the undesirable scenario in which one subset of topics is associated with tokens, while an-\nother (disjoint) subset is associated with edges. Additionally, models of annotated data that possess\nthis exchangeable structure tend to exhibit poor generalization [3, 8]. A better approach, advocated\nby Blei and Jordan, is to draw a topic assignment for each y(d)\nfrom the empirical distribution over\ntopics de\ufb01ned by z(d). By de\ufb01nition, the set of topics associated with edges will therefore be a sub-\nset of the topics associated with tokens. One way of simulating this generative process is to associate\neach y(d)\nn at\nthat position2 by drawing a position assignment x(d)\n. This\nindirect procedure ensures that y(d)\n\nr \u223c U(1, . . . , max (1, N (d))) for each y(d)\nn = t, as desired.\n\nr with a position n = 1, . . . , max (1, N (d)) and therefore with the topic assignment z(d)\n\nr \u223c Bern(p(t)\n\nr \u223c Bern(p(t)\n\nar ) for a(d) = a, x(d)\n\nr\n\nr = n, and z(d)\n\nr\n\n1The function \u03c3(\u00b7) is the logistic function, while the function (cid:107) \u00b7 (cid:107) is the l2-norm.\n2Emails that do not contain any text (i.e., N (d) = 0) convey information about the frequencies of commu-\nnication between their authors and recipients. As a result, we do not omit such emails from D; instead, we\nfor which there is no associated token w(d)\naugment each one with a single, \u201cdummy\u201d topic assignment z(d)\n1 .\n\n1\n\n3\n\n\fInference\n\nd=1, and recipients Y = {y(d)}D\n\n2.1\nd=1, authors A =\nFor real-world data D = {w(d), a(d), y(d)}D\nt=1, B = {b(t)}T\n{a(d)}D\nt=1,\nZ = {z(d)}D\nd=1 are unobserved. Dirichlet\u2013multinomial conjugacy allows \u03a6\nand \u0398 to be marginalized out [9], while typical values for the remaining unobserved variables can\nbe sampled from their joint posterior distribution using Markov chain Monte Carlo methods. In\nthis section, we outline a Metropolis-within-Gibbs sampling algorithm that operates by sequentially\nresampling the value of each latent variable (i.e., s(t)\nr ) from its conditional posterior.\n\nd=1, the tokens W = {w(d)}D\nd=1 are observed, while \u03a6, \u0398, S = {S(t)}T\n\nd=1, and X = {x(d)}D\n\nn , or x(d)\n\na , bt, z(d)\n\nSince z(d)\n\nP (z(d)\n\nn is a discrete random variable, new values may be sampled directly using\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(N (t|d)\nn = t| w(d)\n(cid:81)\n\nn = v,W\\d,n,A,Y,S,B,Z\\d,n,X , \u03b1, \u03b2)\n\\d,n + \u03b1\nT )\n\nN (v|t)\n\\d,n + \u03b2\nV\nN (t)\n\\d,n+\u03b2\ny(d)\nr\n\nr =n (p(t)\n1\u2212y(d)\n\nr:x(d)\n(1 \u2212 p(t)\n\n(1 \u2212 p(t)\n\n(cid:81)\n\na(d)r)\n\na(d)r)\n\n1\u2212y(d)\n\n\u221d\n\ny(d)\nr\n\nr:r(cid:54)=a(d) (p(t)\n\na(d)r)\n\nr\n\na(d)r)\n\nr\n\nfor N (d) > 0\n\notherwise,\n\nwhere subscript \u201c\\d, n\u201d denotes a quantity excluding data from position n in email d. Count N (t) is\nthe total number of tokens in W assigned to topic t by Z, of which N (v|t) are of type v and N (t|d)\nbelong to email d. New values for discrete random variable x(d)\n\nr may be sampled directly using\n\nP (x(d)\n\nr = n|A,Y,S,B, z(d)\n\nn = t,Z\\d,n) \u221d (p(t)\n\na(d)r)\n\ny(d)\nr\n\n(1 \u2212 p(t)\n\na(d)r)\n\n1\u2212y(d)\n\nr\n\n.\n\nNew values for continuous random variables s(t)\nand b(t) cannot be sampled directly from their\na\nconditional posteriors, but may instead be obtained using the Metropolis\u2013Hastings algorithm. With\na non-informative prior over s(t)\n\na \u223c N (0,\u221e)), the conditional posterior over s(t)\n\na (i.e., s(t)\n\nN (1|a,r,t)+N (1|r,a,t)\n\n(1 \u2212 p(t)\nar )\n\n(p(t)\nar )\n\nr:r(cid:54)=a\n\n(cid:16)(cid:80)N (d)\n\na is\nN (0|a,r,t)+N (0|r,a,t)\n\n,\n\n(cid:17)\n\na |A,Y, S(t)\\a , b(t),Z,X ) \u221d (cid:89)\n\nP (s(t)\n\nwhere count N (1|a,r,t) =(cid:80)D\nP (b(t) |A,Y, S(t),Z,X ) \u221d A(cid:89)\n\n(cid:89)\n\n.3 Counts\nN (1|r,a,t), N (0|a,r,t), and N (0|r,a,t) are de\ufb01ned similarly. Likewise, with an improper, non-\ninformative prior over b(t) (i.e., b(t) \u223c N (0,\u221e)), the conditional posterior over b(t) is\n\nd=1 1(a(d) = a) 1(y(d)\n\nr = n) 1(z(d)\n\nn=1 1(x(d)\n\nr = 1)\n\nn = t)\n\nN (1|a,r,t)+N (1|r,a,t)\n\n(1 \u2212 p(t)\nar )\n\nN (0|a,r,t)+N (0|r,a,t)\n\n.\n\n(p(t)\nar )\n\na=1\n\nr:r