{"title": "Diffusion Maps for Textual Network Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 7587, "page_last": 7597, "abstract": "Textual network embedding leverages rich text information associated with the network to learn low-dimensional vectorial representations of vertices.\nRather than using typical natural language processing (NLP) approaches, recent research exploits the relationship of texts on the same edge to graphically embed text. However, these models neglect to measure the complete level of connectivity between any two texts in the graph. We present diffusion maps for textual network embedding (DMTE), integrating global structural information of the graph to capture the semantic relatedness between texts, with a diffusion-convolution operation applied on the text inputs. In addition, a new objective function is designed to efficiently preserve the high-order proximity using the graph diffusion. Experimental results show that the proposed approach outperforms state-of-the-art methods on the vertex-classification and link-prediction tasks.", "full_text": "Diffusion Maps for Textual Network Embedding\n\nXinyuan Zhang, Yitong Li, Dinghan Shen, Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\n{xy.zhang, yitong.li, dinghan.shen, lcarin}@duke.edu\n\nDuke University\n\nDurham, NC 27707\n\nAbstract\n\nTextual network embedding leverages rich text information associated with the\nnetwork to learn low-dimensional vectorial representations of vertices. Rather\nthan using typical natural language processing (NLP) approaches, recent research\nexploits the relationship of texts on the same edge to graphically embed text. How-\never, these models neglect to measure the complete level of connectivity between\nany two texts in the graph. We present diffusion maps for textual network embed-\nding (DMTE), integrating global structural information of the graph to capture\nthe semantic relatedness between texts, with a diffusion-convolution operation\napplied on the text inputs. In addition, a new objective function is designed to ef\ufb01-\nciently preserve the high-order proximity using the graph diffusion. Experimental\nresults show that the proposed approach outperforms state-of-the-art methods on\nthe vertex-classi\ufb01cation and link-prediction tasks.\n\n1\n\nIntroduction\n\nLearning effective vectorial embeddings to rep-\nresent text can lead to improvements in many\nnatural language processing (NLP) tasks. How-\never, most text embedding models do not em-\nbed the semantic relatedness between different\ntexts. Graphical text networks address this prob-\nlem by adding edges between correlated text\nvertices. For example, paper citation networks\ncontain rich textual information and the citation\nrelationships provide structural information to\nre\ufb02ect the similarity between papers. Graphical\ntext embedding naturally extends the problem\nto network embedding (NE), mapping vertices\nof a graph into a low-dimensional space. The\nlearned representations containing structure and textual information can be used as features for\nnetwork tasks, such as vertex classi\ufb01cation [22], link prediction [14], and tag recommendation [31].\nLearning network embeddings is a challenging research problem, due to the sparsity, non-linearity\nand high dimensionality of the graph data.\nIn order to exploit textual information associated with each vertex, some NE models [13, 33, 19, 26]\nembed texts with a variety of NLP approaches, ranging from bag-of-words models to deep neural\nmodels. However these text embedding methods fail to consider the semantic distance indicated\nfrom the graph. In [30, 24] it was recently proposed to simultaneously embed two texts on the same\nedge using a mutual-attention mechanism. But in real-world sparse networks, it is intuitive that two\nconnected vertices do not necessarily share more similarities than two unconnected vertices. Figure 1\n\nFigure 1: Three sentences from the DBLP dataset. Ver-\ntices A and C are second neighbors, i.e., two vertices that\nare not on the same edge but share at lease one common\nneighbor (vertex B). The alignment words are colored.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\nThe K-D-B-Tree: A Search Structure For Large Multidimensional Dynamic Indexes.Segment Indexes: DynamicIndexingTechniques for Multi-DimensionalInterval Data.Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases.ABCCitationCitation\fpresents three examples from the DBLP dataset. By aligning dynamic index and multi-dimensional,\nthe sentences of vertex A and vertex C are closer than the sentence of their common \ufb01rst neighbor,\nvertex B. The relatedness between two vertices that are not linked by an edge cannot be preserved by\nonly capturing the local pairwise proximity.\nWe propose a \ufb02exible approach for textual network embedding, including global structural information\nwithout increasing model complexity. Global structure information serves to capture the long-distance\nrelationship between two texts, incorporating connection paths within different steps. The diffusion-\nconvolution operation [2] is employed to build a latent representation of the graph-structured text\ninputs, by scanning a diffusion map across each vertex. The graph diffusion, comprised of a\nnormalized adjacency matrix and its power series, provides the probability of random walks from\none vertex to another within a certain number of steps in the graph. The idea is to measure the level\nof connectivity between any two texts when considering all paths between them. In this study, we\nconsider text-based information networks, but our model can be \ufb02exibly extended to other types of\ncontent.\nWe further use the graph diffusion to redesign the objective function, capturing high-order proximity.\nUnlike some NE models [27], that calculate the probability of vertex vi being generated by vj, we\npreserve high-order proximity by calculating the probability of vertex vi given the diffusion map\nof vj. Compared to GraRep [5], the proposed objective function is more computationally ef\ufb01cient,\nespecially for large-scale networks, because it does not need matrix factorization during training.\nThis objective function is able to scale to directed or undirected, and weighted or unweighted graphs.\nTo demonstrate the effectiveness of our model, we focus on two common tasks in analysis of textual\ninformation networks: (i) multi-label classi\ufb01cation, where we predict the labels of each text; and (ii)\nlink prediction, where we predict the existence of an edge given a pair of vertices. The experiments\nare conducted on several real-world datasets of information networks. Experimental results show\nthat the DMTE model outperforms all other methods considered. The superiority of the proposed\napproach indicates that the diffusion process helps to incorporate long-distance relationship between\ntexts and thus to achieve more informative textual network embeddings.\n\n2 Related Work\n\nText Embedding Many existing methods embed text messages into a vector space for various NLP\ntasks. Early approaches include bag-of-words models or topic models [4]. The Skip-gram model [16],\nwhich learns distributed word vectors by utilizing word co-occurrences in a local context, has been\nfurther extended to the document level via a paragraph vector [13] to learn text latent representations.\nTo exploit the internal structure of text, more-complicated text embedding models have emerged,\nadopting deep neural network architectures. For example, convolutional neural networks (CNNs)\n[10, 6, 34] have been considered to apply a convolution kernel over different positions of the text,\nfollowed by max-pooling to obtain a \ufb01xed-length vectorial representation. Recursive neural tensor\nnetworks (RNTNs) [25] have applied a tensor-based composition function over parse trees to obtain\nsentence representations. LSTM-based recurrent neural networks (RNNs) [12] capture long-term\ndependencies in the text, using long short-term memory cells. However, deep neural architectures\nusually assume the availability of a large dataset, unrealistic for many information networks. When\nthe data size is small, some methods [18, 9] avoid over-\ufb01tting by simply averaging embeddings of\neach word in the text, achieving competitive empirical results.\n\nNetwork Embedding Earlier works including IsoMap [29], LLE [21], and Laplacian Eigenmaps\n[3] transform feature vectors of vertices into an af\ufb01nity graph, and then solve for the leading\neigenvectors as the embedding. Recent NE models focus on learning the vectorial representation of\nexisting networks. For example, DeepWalk [20] uses the Skip-gram model [16] on vertex sequences\ngenerated by truncated random walks, learning vertex embeddings. In node2vec [8], the random walk\nstrategy of DeepWalk is modi\ufb01ed for multi-scale representation learning. To exploit the distance\nbetween vertices, LINE [27] designed objective functions to preserve the \ufb01rst-order and second-order\nproximity, while [5] integrates global structure information by expanding the proximity into k-order.\nIn [32] deep models are employed to capture the nonlinear network structure. However, all these\nmethods only consider structural information of the network, without leveraging rich heterogeneous\ninformation associated with vertices; this may result in less informative representations, especially\nwhen the edges are sparse.\n\n2\n\n\fTo address this issue, some recent works combine structure and content information to learn better\nembeddings. For example, TADW [33] shows that DeepWalk is equivalent to matrix factorization, and\ntext features can be incorporated into the framework. TriDNR [19] uses information from structure,\ncontent and labels in a coupled neural network architecture, to learn the vertex representation. CENE\n[26] integrates text modeling and structure modeling by regarding the content information as a special\nkind of vertex. CANE [30] learns two embedding vectors for each vertex where the context-aware\ntext embedding is obtained using a mutual attention mechanism. However, none of these methods\ntakes into account the similarities of context in\ufb02uenced by global structural information.\n\n3 Problem De\ufb01nition\nDe\ufb01nition 1. A textual information network is G = (V, E, T ), where V = {vi}i=1,\u00b7\u00b7\u00b7 ,N is the\nset of vertices, E = {ei,j}N\ni,j=1 is the set of edges, and T = {ti}i=1,\u00b7\u00b7\u00b7 ,N is the set of texts associated\nwith vertices. Each edge ei,j has a weight si,j representing the relationship between vertices vi and\nvj. If vi and vj are not linked, si,j = 0. If there exists an edge between vi and vj, si,j = 1 for an\nunweighted graph, and si,j > 0 for a weighted graph. A path is a sequence of edges that connect two\nvertices. The text of vertex vi, ti, is comprised of a word sequence < w1,\u00b7\u00b7\u00b7 , w|ti| >.\nDe\ufb01nition 2. Let S \u2208 RN\u00d7N be the adjacency matrix of a graph whose entry si,j \u2265 0 is the weight\nof edge ei,j. The transition matrix P \u2208 RN\u00d7N is obtained by normalizing rows of S to sum to one,\nwith pi,j representing the transition probability from vertex vi to vertex vj within one step. Then an\nh-step transition matrix can be computed with P to the h-th power, i.e., Ph. The entry ph\ni,j refers to\nthe transition probability from vertex vi to vertex vj within exactly h steps.\n\nDe\ufb01nition 3. A network embedding aims to learn a low-dimensional vector vi \u2208 Rd for vertex\nvi \u2208 V , where d (cid:28) |V | is the dimension of the embedding. The embedding matrix V for the\ncomplete graph is the concatenation of {v1, v2,\u00b7\u00b7\u00b7 , vN}. The distance between vertices on the\ngraph and context similarity should be preserved in the representation space.\n\nDe\ufb01nition 4. The diffusion map of vertex vi is ui, the i-th row of the diffusion embedding matrix\nU, which maps from vertices and their embeddings to the results of a diffusion process that begins at\nvertex vi. U is computed by\n\nU =\n\n\u03bbhPhV,\n\n(1)\n\nH\u22121(cid:88)h=0\n\nwhere \u03bbh is the importance coef\ufb01cient that typically decreases as the value of h increases. The\nhigh-order proximity in the network is preserved in diffusion maps.\n\n4 Method\n\nWe employ a diffusion process to build long-distance semantic relatedness in text embeddings, and\nglobal structural information in the objective function. To incorporate both the structure and textual\ninformation of the network, we adopt two types of embeddings vs\ni for each vi vertex, as\nproposed in [30]. The structure-based embedding vector vs\ni is obtained by feeding the i-th row\nof a learned structure embedding table Es \u2208 RN\u00d7ds into a function. The text-based embedding\nvector vt\ni is obtained by applying the diffusion convolutional operation on the text inputs (see Section\n4.2). Here dimensions of the structure embedding and the text embedding satisfy ds + dt = d. The\nembedding of vertex vi is simply the concatenation of vt\ni . In this work, vi\nis learned by an unsupervised approach, and it can be used directly as a feature vector of vertex vi for\nvarious tasks. The objective function consists of four parts, which measure both the structure and text\nembeddings. The high-order proximity is preserved during training without increasing computational\ncomplexity. The entire framework for textual network embedding is illustrated in Figure 3 where\neach vertex is associated with a text.\n\ni , i.e., vi = vt\n\ni \u2295 vs\n\ni and vt\n\ni and vs\n\n4.1 Diffusion Process\n\n3\n\n\fInitially the network only has a few active ver-\ntices, due to sparsity. Through the diffusion\nprocess, information is delivered from active ver-\ntices to inactive ones by \ufb01lling information gaps\nbetween vertices [1]; vertices may be connected\nby indirect, multi-step paths. This process is\nthe same as the molecular diffusion in a \ufb02uid,\nwhere particles move from high-concentration\nareas to low-concentration areas. We introduce\nthe transition matrix P and its power series for\nthe diffusion process. The directed graph with\nfour vertices and normalized weights in Figure 2\nshows the smoothing effect of the high order of\nP in diffusion process. The original graph only\nhas edges e1,2, e1,3, e3,4 and e1,4, while the in-\nformation gaps between other vertices are not depicted. The diffusion process can smooth the whole\ngraph with the higher order of P, so that indirect relationships, such as (n2, n4), can be connected\n(via a multi-step diffusion process). As we can see from Figure 2(b), the fourth-order diffusion\ngraph is fully connected. The number associated with each edge represents the transition probability\nfrom one vertex to another within exactly 4 steps. The network will be stable when information is\neventually evenly distributed.\n\nFigure 2: A simple example of diffusion process\nin a directed graph.\n\n4.2 Text Embedding\nA word sequence t =< w1,\u00b7\u00b7\u00b7 , w|t| > is mapped into a set of dt-dimensional real-valued vectors\n< w1,\u00b7\u00b7\u00b7 , w|t| > by looking up the word embedding matrix Ew. Here Ew \u2208 R|w|\u00d7dt is initialized\nrandomly, and learned during training, and |w| is the vocabulary size of the dataset. We can obtain a\nsimple text representation xi \u2208 Rdt of vertex vi by taking the average of word vectors. Although\nthe word order is not preserved in such a representation, taking the average of word embeddings\ncan avoid over-\ufb01tting ef\ufb01ciently, especially when the data size is small [23]. Given the \ufb01xed-length\nvectors of each text, the input texts can be represented by matrix X \u2208 RN\u00d7dt, where the i-th row is\nxi.\n\nwi,\n\nX = x1 \u2295 x2 \u2295 \u00b7\u00b7\u00b7 \u2295 xN .\n\n(2)\n\nAlternatively, we can use the bi-directional LSTM [7] which processes a text from both directions to\ncapture long-term dependencies. Text inputs are represented by the mean of all hidden states.\n\nx =\n\n1\n|t|\n\n|t|(cid:88)i=1\n\n\u2212\u2192h i = LST M (wi, hi\u22121),\n(\u2212\u2192h i \u2295 \u2190\u2212h i),\n\nx =\n\n1\n|t|\n\n|t|(cid:88)i=1\n\n\u2190\u2212h i = LST M (wi, hi+1)\n\nX = x1 \u2295 x2 \u2295 \u00b7\u00b7\u00b7 \u2295 xN .\n\n(3)\n\n(4)\n\nHowever, in this text representation matrix for both approaches, the embeddings are completely\nindependent, without leveraging the semantic relatedness indicated from the graph. To address this\nissue, we employ the diffusion convolutional operator [2] to measure the level of connectivity between\nany of two texts in the network.\nLet P\u2217 \u2208 RN\u00d7H\u00d7N be a tensor containing H hops of power series of P, i.e., the concatenation of\n{P0, P1,\u00b7\u00b7\u00b7 , PH\u22121}. V\u2217t \u2208 RN\u00d7H\u00d7d is the tensor version of the text embedding representation,\nafter the diffusion convolutional operation. The activation V\u2217(i,j,k)\nfor vertex i, hop j, and feature k\nis given by\n\nt\n\nV\u2217(i,j,k)\n\nt\n\n= f (W(j,k) \u00b7\n\nN(cid:88)n=1\n\nP\u2217(i,j,n)X(n,k)),\n\n(5)\n\nwhere W \u2208 RH\u00d7d is the weight matrix and f is a nonlinear differentiable function. The activations\ncan be expressed equivalently using tensor notation\n(6)\n\nV\u2217t = f (W (cid:12) P\u2217X),\n\n4\n\n12341.00.420.380.620.890.110.080.5(a)Originalgraph12340.050.110.540.020.320.100.510.330.420.070.400.110.100.230.070.59(b)Forthorderdiffusiongraph.Figure2:(Left)Originalgraphonlyhaveconnectededgee1,2,e1,3,e3,4ande1,4.Hereweplotitasdirectedgraphbecausewenormalizetheoutgoingedgesweight.(Right)Forthpowerdiffusiongraph.onthegraph.Figure2givesanexampleofthesmoothingeffectofdiffusiongraph.Thisexampleonly137containsfournodes.Theedgesarenormalizedsothegraphbecomesdirected.Theoriginalgraph138onlyhaveedgepaire1,2,e1,3,e3,4ande1,4.However,theindirectrelationshipbetweenotheredge139pairsarenotconsidered.Diffusiongraphcansmoothingthewholegraphwithhigherorder.Thus140thoseindirectrelationships,like(n2,n4),canalsobeconsidered.Aswecanseefrom\ufb01gure2(b),the141forthorderdiffusiongraphbecomesfullyconnected.Whentheordergoestoin\ufb01nity,itcorresponds142totheconvergencepointofarandomwalk.1434.2TextEmbedding144Awordsequencet=<w1,\u00b7\u00b7\u00b7,w|t|>ismappedintoasetofdt-dimensionalreal-valuedvectors145<w1,\u00b7\u00b7\u00b7,w|t|>bylookingupthewordembeddingmatrixEw.HereEw\u2208R|w|\u00d7dtisrandomly146initializedandfurtherlearnedduringtrainingand|w|isthevocabularysizeofthedataset.Wecan147obtainasimpletextrepresentationxi\u2208Rdtofverticevibytakingtheaverageofwordvectors.148Althoughthewordorderisnotpreservedinsuchrepresentation,[5]hasshownthatwordembedding149averagemodelscanperformsurprisinglywellandavoidover-\ufb01ttingef\ufb01cientlyinmanyNLPtasks.150Giventhe\ufb01xed-lengthvectorsofeachtext,theinputtextscanberepresentedbymatrixX\u2208RN\u00d7dt151wherethei-throwisxi.152x=1c|t|(cid:2)i=1wi,X=x1\u2295x2\u2295\u00b7\u00b7\u00b7\u2295xN.However,inthistextrepresentationmatrixeachembeddingiscompletelyindependentwithout153leveragingthesemanticrelatednessindicatedfromthegraph.Toaddressthisissue,weemploy154diffusionconvolutionaloperator[1]tomeasurethelevelofconnectivitybetweenanyoftwotextsin155thenetwrok.156LetP\u2217\u2208RN\u00d7H\u00d7NbeatensorcontainingHhopsofpowerseriesofP,i.e.,theconcatenationof157{P0,P1,\u00b7\u00b7\u00b7,PH\u22121}.V\u2217t\u2208RN\u00d7H\u00d7disthetensorversionoftextembeddingrepresentionafter158diffusionconvolutionaloperation.TheactivationV\u2217(i,j,k)tfornodei,hopj,andfeaturekisgivenby159V\u2217(i,j,k)t=f(W(j,k)\u00b7N(cid:2)n=1P\u2217(i,j,n)X(n,k))(2)whereW\u2208RH\u00d7distheweightmatrixandfisanon-lineardifferentiablefunction.Theactivations160canbeexpressedequavalentlyusingtensornotations.161V\u2217t=f(W(cid:4)P\u2217X)(3)where(cid:4)representselement-wisemultiplication.Thistensorrepresentationconsidersallpaths162betweentwotextsinthenetworkandthusincludeslong-distancesemanticrelationship.Withlonger1634\fFigure 3: An illustration of our framework for textual network embedding.\n\nwhere (cid:12) represents element-wise multiplication. This tensor representation considers all paths\nbetween two texts in the network, and thus includes long-distance semantic relationship. With longer\npaths discounted more than shorter paths, the text embedding matrix Vt is given by\n\nVt =\n\n\u03bbhV\u2217(:,h,:)\n\nt\n\n.\n\n(7)\n\nH\u22121(cid:88)h=0\n\nThrough the diffusion process, text representations, i.e., rows of Vt are not embedded independently.\nWith the whole graph being smoothed, indirect relationships between texts that are not on the same\nedge can be considered to learn embeddings.\n\n4.3 Objective Function\n\nGiven the set of edges E, the goal of DMTE is to maximize the following overall objective function:\n\n\u03b1ttLtt(e) + \u03b1ssLss(e) + \u03b1stLst(e) + \u03b1tsLts(e)\n\n(8)\n\nL =(cid:88)e\u2208E\n\nL(e) =(cid:88)e\u2208E\n\nwhere \u03b1tt, \u03b1ss, \u03b1st, and \u03b1ts control the weight of corresponding objectives. The overall objective\nconsists of four parts: Ltt(e) denotes the objective for text embeddings, Lss(e) denotes the objective\nfor structure embeddings, Lst(e) and Lts(e) denote the objectives that consider both structure and\ntext embeddings to map them into the same representation space. We assume the network is directed,\nsince the undirected edge can be considered as two opposite-directed edges with equal weights. Then\neach objective is to measure the log-likelihood of generating vi conditioned on vj, where vi and vj\nare on the same directed edge:\n\nLtt(e) = si,j log p(vt\n\nj) = si,j log\n\nLss(e) = si,j log p(vs\n\nj) = si,j log\n\nLst(e) = si,j log p(vs\n\nj) = si,j log\n\nLts(e) = si,j log p(vt\n\nj) = si,j log\n\ni|vt\n\ni|us\n\ni|vt\n\ni|us\n\nexp(vt\n\nk\u2208Vt\nexp(vs\n\nk\u2208Vs\nexp(vs\n\nk\u2208Vs\nexp(vt\n\n,\n\ni \u00b7 vt\nj)\nk \u00b7 vt\nexp(vt\nj)\ni \u00b7 us\nj)\nexp(vs\nk \u00b7 us\nj)\ni \u00b7 vt\nj)\nexp(vs\nk \u00b7 vt\nj)\ni \u00b7 us\nj)\nexp(vt\nk \u00b7 us\nj)\n\n,\n\n,\n\n.\n\n(cid:80)vt\n(cid:80)vs\n(cid:80)vs\n(cid:80)vt\n\nk\u2208Vt\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nj) computes the probability conditioned on the diffusion map of vertex vj, and p(\u00b7|vt\nNote that p(\u00b7|us\nj)\ncomputes the probability conditioned on the text embedding of vertex vj. Compared to using vs\nj to\ncompute the conditional probability, the diffusion map us\nj utilizes both local information and global\nrelations of vertex vj in the graph. We use vt\nj because the global\nstructural information is included during text embedding, with the diffusion convolutional operation.\nMoreover the high-order proximity is preserved without using matrix factorization, which may be\ncomputationally inef\ufb01cient for large-scale networks.\n\nj instead of the diffusion map ut\n\n5\n\n1234\u2026\ud835\udc77\ud835\udc77\ud835\udfce\ud835\udfce\ud835\udc77\ud835\udc77\ud835\udfcf\ud835\udfcf\ud835\udc77\ud835\udc77\ud835\udc6f\ud835\udc6f\u2212\ud835\udfcf\ud835\udfcf*Structure Embedding Table \ud835\udc6c\ud835\udc6c\ud835\udc94\ud835\udc94Word Embedding Table \ud835\udc6c\ud835\udc6c\ud835\udc98\ud835\udc98\ud835\udc99\ud835\udc991\ud835\udc99\ud835\udc992\ud835\udc99\ud835\udc993\ud835\udc99\ud835\udc994\ud835\udf06\ud835\udf060\ud835\udf06\ud835\udf061\ud835\udf06\ud835\udf06\ud835\udc3b\ud835\udc3b\u22121\ud835\udc97\ud835\udc97\ud835\udfcf\ud835\udfcf\ud835\udc95\ud835\udc95\ud835\udc97\ud835\udc97\ud835\udfd0\ud835\udfd0\ud835\udc95\ud835\udc95\ud835\udc97\ud835\udc97\ud835\udfd1\ud835\udfd1\ud835\udc95\ud835\udc95\ud835\udc97\ud835\udc97\ud835\udfd2\ud835\udfd2\ud835\udc95\ud835\udc95\ud835\udc97\ud835\udc97\ud835\udfcf\ud835\udfcf\ud835\udc94\ud835\udc94\ud835\udc97\ud835\udc97\ud835\udfd0\ud835\udfd0\ud835\udc94\ud835\udc94\ud835\udc97\ud835\udc97\ud835\udfd2\ud835\udfd2\ud835\udc94\ud835\udc94\ud835\udc97\ud835\udc97\ud835\udfd1\ud835\udfd1\ud835\udc94\ud835\udc94\f4.4 Optimization\n\nOptimizing (8) is computationally expensive, since the conditional probability requires the summation\nover the entire vertex set. In [17] negative sampling was proposed to solve this problem. For each\nedge ei,j, we sample multiple negative edges according to some noisy distribution. Then during\ntraining the conditional function p(vi|vj) can be replaced by\n\nlog \u03c3(vi \u00b7 vj) +\n\nK(cid:88)k=1\n\nEvk\u223cPn(v)[log \u03c3(\u2212vk \u00b7 vj)],\n\n(13)\n\nwhere \u03c3(\u00b7) is the sigmoid function, K is the number of negative samples, and Pn(v) \u221d d3/4\nis the\ndistribution of vertices with dv being the out-degree of vertex v. All parameters are jointly trained.\nAdam [11] is adopted for stochastic optimization. In each step, Adam samples a mini-batch of edges\nand then updates the model parameters.\n\nv\n\n5 Experiments\n\nWe evaluate the proposed method for the multi-label classi\ufb01cation and link prediction tasks. We\ndesign four versions of DMTE in our experiments: (i) DMTE without diffusion process; (ii) DMTE\nwith text embedding only; (iii) DMTE with bidirectional LSTM (Bi-LSTM); (iv) DMTE with\nword average embedding (WAvg). In DMTE without diffusion process, the diffusion convolutional\noperation is not added on top of the text inputs, i.e., the text embedding matrix Vt is directly replaced\nby X in Eq. 2. In DMTE with text embedding only, the embedding of vertex vi is only vt\ni instead of\nthe concatenation of vt\ni . In DMTE with Bi-LSTM, the input texts embedding matrix Xt is\nobtained using Eq. 4. In DMTE with WAvg, the input texts embedding matrix Xt is obtained using\nEq. 2. We compare the four versions of DMTE model with seven competitive network embedding\nalgorithms. Experimental results for multi-label classi\ufb01cation are evaluated by Macro F1 scores and\nexperimental results for link prediction are evaluated by Area Under the Curve (AUC).\n\ni and vs\n\nDatasets We conduct experiments on three real-world datasets: DBLP, Cora, and Zhihu.\n\u2022 DBLP [28] is a citation network that consists of bibliography data in computer science. In our\nexperiments, 60744 papers are collected in 4 research areas: database, data mining, arti\ufb01cial in-\ntelligence, and computer vision. The network has 52890 edges indicating the citation relationship\nbetween papers.\n\nnetwork has 5214 edges indicating the citation relationship between papers.\n\n\u2022 Cora [15] is a citation network that consists of 2277 machine learning papers in 7 classes. The\n\u2022 Zhihu [26] is a Q&A based community social network in China. In our experiments, 10000\nactive users are collected as vertices and 43894 edges indicating the relationship. The description\nof their interested topics are used as text information.\n\nBaselines The following baselines are compared with our DMTE model:\n\u2022 Structure-Based Methods: DeepWalk [20], LINE [27], node2vec [8].\n\u2022 Structure and Text Combined Methods: TADW [33], Tri-DNR [19], CENE [26], CANE [30].\nEvaluation and Parameter Settings For link prediction, we evaluate the performance with AUC,\nwhich is widely used for a ranking list. Since the testing set only contains existing edges as positive\ninstances, we randomly sample the same number of non-existing edges as negative instances. Positive\nand negative edges are ranked according to a prediction function and AUC is employed to measure\nthe probability that vertices on a positive edge are more similar than those on a negative edge. The\nexperiment for each training ratio is executed 10 times and the mean AUC scores are reported, where\nthe higher value indicates a better performance.\nFor multi-label classi\ufb01cation, we evaluate the performance with Macro-F1 scores. We \ufb01rst learn\nembeddings with all edges and vertices in an unsupervised way. Once the vertex embeddings are\nobtained, we feed them into a classi\ufb01er. The experiment for each training ratio is executed 10 times\nand the mean Macro-F1 scores are reported where the higher value indicates a better performance.\n\n6\n\n\fTable 1: AUC scores for link prediction on Cora.\n\n% of edges\nDeep Walk\nLINE\nnode2vec\nTADW\nTriDNR\nCENE\nCANE\nDMTE (w/o diffusion)\nDMTE (text only)\nDMTE (Bi-LSTM)\nDMTE (WAvg)\n\n15% 25% 35% 45% 55% 65% 75% 85% 95%\n90.3\n56.0\n89.3\n55.0\n88.2\n55.9\n86.6\n92.7\n93.7\n85.9\n95.9\n72.1\n97.7\n86.8\n96.7\n87.4\n82.6\n94.2\n98.1\n86.3\n91.3\n98.8\n\n80.1\n77.6\n78.7\n90.0\n91.3\n89.4\n94.6\n93.9\n89.1\n94.1\n96.0\n\n85.3\n85.6\n85.9\n91.0\n93.0\n93.9\n95.6\n95.5\n92.0\n96.0\n97.4\n\n87.8\n88.4\n87.3\n93.4\n93.6\n95.0\n96.6\n95.9\n92.9\n97.3\n98.2\n\n70.2\n66.4\n66.1\n90.2\n90.5\n84.6\n92.2\n92.0\n85.7\n90.7\n93.7\n\n63.0\n58.6\n62.4\n88.2\n88.6\n86.5\n91.5\n91.2\n84.0\n88.2\n93.1\n\n85.2\n82.8\n81.6\n93.0\n92.4\n89.2\n94.9\n94.6\n91.1\n94.8\n97.1\n\n75.5\n73.0\n75.0\n90.8\n91.2\n88.1\n93.9\n93.2\n87.3\n92.7\n95.0\n\nTable 2: AUC scores for link prediction on Zhihu.\n\n% of edges\nDeep Walk\nLINE\nnode2vec\nTADW\nTriDNR\nCENE\nCANE\nDMTE (w/o diffusion)\nDMTE (text only)\nDMTE (Bi-LSTM)\nDMTE (WAvg)\n\n15% 25% 35% 45% 55% 65% 75% 85% 95%\n67.8\n56.6\n71.1\n52.3\n54.2\n68.5\n69.0\n52.3\n70.3\n53.8\n73.8\n56.2\n75.4\n56.8\n56.2\n75.1\n74.1\n55.9\n82.2\n56.3\n58.4\n81.5\n\n61.8\n64.3\n58.7\n60.8\n63.0\n66.3\n68.9\n68.5\n65.3\n73.2\n74.0\n\n58.1\n55.9\n57.1\n54.2\n55.7\n57.4\n59.3\n58.4\n57.2\n60.3\n63.2\n\n60.1\n59.9\n57.3\n55.6\n57.9\n60.3\n62.9\n61.3\n58.8\n64.9\n67.5\n\n60.0\n60.9\n58.3\n57.3\n59.5\n63.0\n64.5\n64.0\n61.6\n69.8\n71.6\n\n61.9\n66.0\n62.5\n62.4\n64.6\n66.0\n70.4\n69.7\n67.6\n76.4\n76.7\n\n63.7\n69.3\n67.6\n63.8\n67.5\n69.8\n73.6\n73.3\n71.0\n80.3\n79.8\n\n63.3\n67.7\n66.2\n65.2\n66.0\n70.2\n71.4\n71.5\n69.5\n78.7\n78.5\n\nWe set the embedding of dimension d to 200 with ds and dt both equal to 100. The number of hops\nH is set to 4 and the importance coef\ufb01cients \u03bbh\u2019s are tuned for different datasets and different tasks\nwith \u03bb0 > \u03bb1 > \u00b7\u00b7\u00b7 > \u03bbH. \u03b1tt, \u03b1ss, \u03b1ts, and \u03b1st are set to 1, 1, 0.3 and 0.3 respectively. The\nnumber of negative samples K is set to 1 to speed up the training process. The word embedding\nmatrix Ew, the structure embedding table Es,and the diffusion weight matrix W are all randomly\ninitialized with a truncated Gaussian distribution. All models are implemented in Tensor\ufb02ow using a\nNVIDIA Titan X GPU with 12 GB memory.\n\n5.1 Link Prediction\n\nGiven a pair of vertices, link prediction\nseeks to predict the existence of an unob-\nserved edge using the trained representa-\ntions. We use Cora and Zhihu datasets for\nlink prediction. We randomly hold out a\nportion of edges (%e) for training in an un-\nsupervised way with the rest of edges for\ntesting.\nTables 1 and 2 show the AUC scores of dif-\nferent models for %e from 15% to 95% on\nCora and Zhihu. The best performance is\nhighlighted in bold. As can be seen from\nboth tables, our proposed method performs\nbetter than all other baseline methods. The\nAUC gains of DMTE model over the state-\nof-the-art CANE model can be as much as\n4.5 and 6.8 on Cora and Zhihu respectively.\nThese results demonstrate the effectiveness\nof the learned embeddings using the pro-\nposed method on link prediction task. We observe that baselines incorporating both structure and text\n\nFigure 4: Performance over H.\n\n7\n\nH=1H=2H=3H=4H=5H=60.870.880.890.90.910.92AUC15%H=1H=2H=3H=4H=5H=60.920.9250.930.9350.94AUC35%H=1H=2H=3H=4H=5H=60.9350.940.9450.950.9550.960.965AUC55%H=1H=2H=3H=4H=5H=60.9550.960.9650.970.9750.98AUC75%\fTable 3: Top-5 similar vertex search based on embeddings learned by DMTE.\n\nQuery: The K-D-B-Tree: A Search Structure For Large Multidimensional Dynamic Indexes.\n1. The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.\n2. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries.\n3. Segment Indexes: Dynamic Indexing Techniques for Multi-Dimensional Interval Data.\n4. Generalized Search Trees for Database Systems.\n5. High Performance Clustering Based on the Similarity Join.\n\ninformation perform better than those only utilizes structure information, which indicates that text\nassociated with each vertex helps to achieve more informative embeddings. The proposed approach\nshows \ufb02exibility and robustness in various training ratios. As the portion of training edges gets larger,\nthe performance of our DMTE model steadily increases while other approaches suffer under either\nlow training ratio (such as CENE) or high training ratio (such as TADW).\nComparing the four versions of DMTE, DMTE with word embedding average as the text inputs has\nthe best performance on Cora at all training ratios and on Zhihu at low training ratios, while DMTE\nwith bidirectional LSTM as the text inputs has the best performance on Zhihu at high training ratios.\nThis is because when the training data is limited, the model with less parameters can successfully\navoid over-\ufb01tting and thus achieve better results. For larger networks like Zhihu with high training\ndata ratios, deep models (such as Bi-LSTM) with more parameters can be a good choice to encode\ninput texts. The model with the diffusion convolutional operation applied on text inputs performs\nbetter than the model without the diffusion process, verifying our assumption that the diffusion process\ncan help include long-distance semantic relationship and thus achieves better embeddings. We also\nobserve that DMTE with text embeddings only performs better than some baseline methods but\nworse than the other three DMTE variations, demonstrating the effectiveness of text embeddings and\nthe necessity of adding structure embeddings. Furthermore, DMTE with only the word-embedding\naverage as the text representation has comparable performance over baselines, demonstrating the\neffectiveness of the redesigned objective function, which calculates the conditional probability of\ngenerating vi given the diffusion map of vj.\n\nParameter Sensitivity Figure 4 shows the link prediction results w.r.t. the number of hops H\nat different training ratios. The model we use here is DMTE(WAvg). Note that when H = 1 the\nmodel is equivalent to DMTE without diffusion precess. As H gets larger, the performance of DMTE\nincreases initially then stops increasing when H is big enough. This observation indicates that the\ndiffusion process can help exploit the relatedness of any two vertices in the graph, however this\nrelatedness is neglectable when the distance between two vertices is too long.\n\n5.2 Multi-Label Classi\ufb01cation\n\nMulti-label classi\ufb01cation seeks to classify each ver-\ntex into a set of labels using the learned vertex\nrepresentation as features. We use DBLP dataset\nfor multi-label classi\ufb01cation. Here DMTE refers to\nDMTE(WAvg). To maximally reduce the impact of\ncomplicated learning approaches on the classi\ufb01cation\nperformance, a linear SVM is employed instead of\na sophisticated deep classi\ufb01er. We randomly sam-\nple a portion of labeled vertices with embeddings\n(%l = {10%, 30%, 50%, 70%}) to train the classi\ufb01er\nwith the rest vertices for testing.\nFigure 5 shows the AUC scores of different models on\nDBLP. Compared to baselines, the proposed DMTE\nmodel consistently achieves performance improve-\nment at all training ratios, demonstrating that DMTE\nlearns high-quality embeddings which can be used di-\nrectly as features for multi-label vertex classi\ufb01cation.\nThe F1-Macro score gains of DMTE over baseline\n\n8\n\nFigure 5: F1-Macro scores for multi-label\nclassi\ufb01cation on DBLP.\n\n10%30%50%70%Label Percentage0.30.40.50.60.70.80.9F1-Macro ScoreDeepWalkLINETADWTriDNRCANEDMTE\fCANE indicates that the embeddings learned using global structure information is more informative\nthan only considering local pairwise proximity. We also observe that structure-based methods perform\nmuch worse than methods based on structure and text combined, which further shows the importance\nof integrating both structure and text information in textual network embeddings.\n\n5.3 Case Study\n\nTo visualize the effectiveness of the learned embeddings, we retrieve the most similar vertices and\ntheir corresponding texts for a given query vertex. The distance is evaluated by cosine similarity\nbased on the vectorial representations learned by DMTE. Table 3 shows the texts of the top 5 closest\nvertex embeddings of a query paper in DBLP dataset. In the graph, vertices 1, 2, 4, and 5 are all\nneighbors of the query while vertex 3 is not directly connected with the query vertex. As observed,\ndirect neighbors vertices 1 and 2 are not only structurally but also textually similar to the query vertex\nwith multiple words aligned such as tree, index and multi-dimensional. Although vertex 3 is not\non the same edge with the query vertex, the semantic relatedness makes it closer than the query\u2019s\ndirect neighbors such as vertex 4 and 5. This is an illustration that the embeddings learned by DMTE\nsuccessfully incorporate both structure and text information, helping to explain the quality of the\naforementioned results.\n\n6 Conclusions\n\nWe have proposed a new DMTE model for textual network embedding. Unlike existing embedding\nmethods, that neglect semantic relatedness between texts or only exploit local pairwise relationship,\nthe proposed method integrates global structural information of the graph to capture the level of\nconnectivity between any two texts, by applying a diffusion convolutional operation on the text\ninputs. Furthermore, we designed a new objective that preserves high-order proximity, by including a\ndiffusion map in the conditional probability. We conducted experiments on three real-word networks\nfor multi-label classi\ufb01cation and link prediction, and the associated results demonstrate the superiority\nof the proposed DMTE model.\n\nAcknowledgments\nThe authors would like to thank the anonymous reviewers for their insightful comments. This research\nwas supported in part by DARPA, DOE, NIH, ONR and NSF.\n\nReferences\n[1] E. Abrahamson and L. Rosenkopf. Social network effects on the extent of innovation diffusion:\n\nA computer simulation. Organization science, 1997.\n\n[2] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In NIPS, 2016.\n\n[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. In Advances in neural information processing systems, 2002.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning\n\nresearch, 2003.\n\n[5] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural\ninformation. In Proceedings of the 24th ACM International on Conference on Information and\nKnowledge Management. ACM, 2015.\n\n[6] Z. Gan, Y. Pu, R. Henao, C. Li, X. He, and L. Carin. Learning generic sentence representations\nusing convolutional neural networks. In Proceedings of the 2017 Conference on Empirical\nMethods in Natural Language Processing, 2017.\n\n[7] A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidirectional\nlstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on.\nIEEE, 2013.\n\n9\n\n\f[8] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings\nof the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.\nACM, 2016.\n\n[9] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daum\u00e9 III. Deep unordered composition\nrivals syntactic methods for text classi\ufb01cation. In Proceedings of the 53rd Annual Meeting of\nthe Association for Computational Linguistics and the 7th International Joint Conference on\nNatural Language Processing (Volume 1: Long Papers), volume 1, 2015.\n\n[10] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling\n\nsentences. arXiv preprint arXiv:1404.2188, 2014.\n\n[11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[12] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler.\n\nSkip-thought vectors. In Advances in neural information processing systems, 2015.\n\n[13] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International\n\nConference on Machine Learning, 2014.\n\n[14] L. L\u00fc and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical\n\nmechanics and its applications, 2011.\n\n[15] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet\n\nportals with machine learning. Information Retrieval, 2000.\n\n[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in\n\nvector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in neural information processing\nsystems, 2013.\n\n[18] J. Mitchell and M. Lapata. Composition in distributional models of semantics. Cognitive\n\nscience, 2010.\n\n[19] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang. Tri-party deep network representation. Network,\n\n2016.\n\n[20] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining. ACM, 2014.\n\n[21] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nscience, 2000.\n\n[22] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi\ufb01ca-\n\ntion in network data. AI magazine, 2008.\n\n[23] D. Shen, G. Wang, W. Wang, M. Renqiang Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin.\nBaseline needs more love: On simple word-embedding-based models and associated pooling\nmechanisms. In ACL, 2018.\n\n[24] D. Shen, X. Zhang, R. Henao, and L. Carin. Improved semantic-aware network embedding\n\nwith \ufb01ne-grained word alignment. arXiv preprint arXiv:1808.09633, 2018.\n\n[25] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep\nmodels for semantic compositionality over a sentiment treebank. In Proceedings of the 2013\nconference on empirical methods in natural language processing, 2013.\n\n[26] X. Sun, J. Guo, X. Ding, and T. Liu. A general framework for content-enhanced network\n\nrepresentation learning. arXiv preprint arXiv:1610.02906, 2016.\n\n10\n\n\f[27] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information\nnetwork embedding. In Proceedings of the 24th International Conference on World Wide Web.\nInternational World Wide Web Conferences Steering Committee, 2015.\n\n[28] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of\nacademic social networks. In Proceedings of the 14th ACM SIGKDD international conference\non Knowledge discovery and data mining. ACM, 2008.\n\n[29] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. science, 2000.\n\n[30] C. Tu, H. Liu, Z. Liu, and M. Sun. Cane: Context-aware network embedding for relation\nmodeling. In Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), volume 1, 2017.\n\n[31] C. Tu, Z. Liu, and M. Sun. Inferring correspondences from multiple sources for microblog user\n\ntags. In Chinese National Conference on Social Media Processing. Springer, 2014.\n\n[32] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22nd\nACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.\n\n[33] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang. Network representation learning with rich\n\ntext information. In IJCAI, 2015.\n\n[34] X. Zhang, R. Henao, Z. Gan, Y. Li, and L. Carin. Multi-label learning from medical plain text\n\nwith convolutional residual models. arXiv preprint arXiv:1801.05062, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3765, "authors": [{"given_name": "Xinyuan", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Yitong", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Dinghan", "family_name": "Shen", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}