{"title": "Poincar\u00e9 Embeddings for Learning Hierarchical Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 6338, "page_last": 6347, "abstract": "Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, state-of-the-art embedding methods typically do not account for latent hierarchical structures which are characteristic for many complex symbolic datasets. In this work, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an n-dimensional Poincar\u00e9 ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We present an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincar\u00e9 embeddings can outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.", "full_text": "Poincar\u00e9 Embeddings for\n\nLearning Hierarchical Representations\n\nMaximilian Nickel\nFacebook AI Research\n\nmaxn@fb.com\n\nAbstract\n\nDouwe Kiela\n\nFacebook AI Research\n\ndkiela@fb.com\n\nRepresentation learning has become an invaluable approach for learning from sym-\nbolic data such as text and graphs. However, state-of-the-art embedding methods\ntypically do not account for latent hierarchical structures which are characteristic\nfor many complex symbolic datasets. In this work, we introduce a new approach\nfor learning hierarchical representations of symbolic data by embedding them into\nhyperbolic space \u2013 or more precisely into an n-dimensional Poincar\u00e9 ball. Due to\nthe underlying hyperbolic geometry, this allows us to learn parsimonious repre-\nsentations of symbolic data by simultaneously capturing hierarchy and similarity.\nWe present an ef\ufb01cient algorithm to learn the embeddings based on Riemannian\noptimization and show experimentally that Poincar\u00e9 embeddings can outperform\nEuclidean embeddings signi\ufb01cantly on data with latent hierarchies, both in terms\nof representation capacity and in terms of generalization ability.\n\n1\n\nIntroduction\n\nLearning representations of symbolic data such as text, graphs and multi-relational data has become\na central paradigm in machine learning and arti\ufb01cial intelligence. For instance, word embeddings\nsuch as WORD2VEC [20], GLOVE [27] and FASTTEXT [5, 16] are widely used for tasks ranging\nfrom machine translation to sentiment analysis. Similarly, embeddings of graphs such as latent space\nembeddings [15], NODE2VEC [13], and DEEPWALK [28] have found important applications for\ncommunity detection and link prediction in social networks. Furthermore, embeddings of multi-\nrelational data such as RESCAL [22], TRANSE [7], and Universal Schema [31] are being used for\nknowledge graph completion and information extraction.\nTypically, the objective of an embedding method is to organize symbolic objects (e.g., words, entities,\nconcepts) in a way such that their similarity or distance in the embedding space re\ufb02ects their semantic\nsimilarity. For instance, Mikolov et al. [20] embed words in Rd such that their inner product is\nmaximized when words co-occur within similar contexts in text corpora. This is motivated by the\ndistributional hypothesis [14, 11], i.e., that the meaning of words can be derived from the contexts in\nwhich they appear. Similarly, Hoff et al. [15] embed social networks such that the distance between\nsocial actors is minimized if they are connected in the network. This re\ufb02ects the homophily property\nthat is characteristic for many networks, i.e. that similar actors tend to associate with each other.\nAlthough embedding methods have proven successful in numerous applications, they suffer from\na fundamental limitation: their ability to model complex patterns is inherently bounded by the\ndimensionality of the embedding space. For instance, Nickel et al. [23] showed that linear embeddings\nof graphs can require a prohibitively large dimensionality to model certain types of relations. Although\nnon-linear embeddings can mitigate this problem [8], complex graph patterns can still require a\ncomputationally infeasible embedding dimension. As a consequence, no method yet exists that is\nable to compute embeddings of large graph-structured data \u2013 such as social networks, knowledge\ngraphs or taxonomies \u2013 without loss of information. Since the ability to express information is a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fprecondition for learning and generalization, it is therefore important to increase the representation\ncapacity of embedding methods such that they can realistically be used to model complex patterns on\na large scale. In this work, we focus on mitigating this problem for a certain class of symbolic data,\ni.e., large datasets whose objects can be organized according to a latent hierarchy \u2013 a property that is\ninherent in many complex datasets. For instance, the existence of power-law distributions in datasets\ncan often be traced back to hierarchical structures [29]. Prominent examples of power-law distributed\ndata include natural language (Zipf\u2019s law [40]) and scale-free networks such as social and semantic\nnetworks [32]. Similarly, the empirical analysis of Adcock et al. [1] indicated that many real-world\nnetworks exhibit an underlying tree-like structure.\nTo exploit this structural property for learning more ef\ufb01cient representations, we propose to compute\nembeddings not in Euclidean but in hyperbolic space, i.e., space with constant negative curvature.\nInformally, hyperbolic space can be thought of as a continuous version of trees and as such it is\nnaturally equipped to model hierarchical structures. For instance, it has been shown that any \ufb01nite\ntree can be embedded into a \ufb01nite hyperbolic space such that distances are preserved approximately\n[12]. We base our approach on a particular model of hyperbolic space, i.e., the Poincar\u00e9 ball model,\nas it is well-suited for gradient-based optimization. This allows us to develop an ef\ufb01cient algorithm\nfor computing the embeddings based on Riemannian optimization, which is easily parallelizable\nand scales to large datasets. Experimentally, we show that our approach can provide high quality\nembeddings of large taxonomies \u2013 both with and without missing data. Moreover, we show that\nembeddings trained on WORDNET provide state-of-the-art performance for lexical entailment. On\ncollaboration networks, we also show that Poincar\u00e9 embeddings are successful in predicting links in\ngraphs where they outperform Euclidean embeddings, especially in low dimensions.\nThe remainder of this paper is organized as follows: In Section 2 we brie\ufb02y review hyperbolic\ngeometry and discuss related work. In Section 3 we introduce Poincar\u00e9 embeddings and present\na scalable algorithm to compute them. In Section 4 we evaluate our approach on tasks such as\ntaxonomy embedding, link prediction in networks and predicting lexical entailment.\n\n2 Embeddings and Hyperbolic Geometry\n\nHyperbolic geometry is a non-Euclidean geometry which studies spaces of constant negative curvature.\nIt is, for instance, related to Minkowski spacetime in special relativity. In network science, hyperbolic\nspaces have started to receive attention as they are well-suited to model hierarchical data. For\ninstance, consider the task of embedding a tree into a metric space such that its structure is re\ufb02ected\nin the embedding. A regular tree with branching factor b has (b + 1)b(cid:96)\u22121 nodes at level (cid:96) and\n((b + 1)b(cid:96) \u2212 2)/(b \u2212 1) nodes on a level less or equal than (cid:96). Hence, the number of children grows\nexponentially with their distance to the root of the tree. In hyperbolic geometry this kind of tree\nstructure can be modeled easily in two dimensions: nodes that are exactly (cid:96) levels below the root\nare placed on a sphere in hyperbolic space with radius r \u221d (cid:96) and nodes that are less than (cid:96) levels\nbelow the root are located within this sphere. This type of construction is possible as hyperbolic\ndisc area and circle length grow exponentially with their radius.1 See Figure 1b for an example.\nIntuitively, hyperbolic spaces can be thought of as continuous versions of trees or vice versa, trees\ncan be thought of as \"discrete hyperbolic spaces\" [19]. In R2, a similar construction is not possible\nas circle length (2\u03c0r) and disc area (2\u03c0r2) grow only linearly and quadratically with regard to r\nin Euclidean geometry. Instead, it is necessary to increase the dimensionality of the embedding to\nmodel increasingly complex hierarchies. As the number of parameters increases, this can lead to\ncomputational problems in terms of runtime and memory complexity as well as to over\ufb01tting.\nDue to these properties, hyperbolic space has recently been considered to model complex networks.\nFor instance, Kleinberg [18] introduced hyperbolic geometry for greedy routing in geographic\ncommunication networks. Similarly, Bogu\u00f1\u00e1 et al. [4] proposed hyperbolic embeddings of the AS\nInternet topology to perform greedy shortest path routing in the embedding space. Krioukov et al.\n[19] developed a geometric framework to model complex networks using hyperbolic space and\nshowed how typical properties such as heterogeneous degree distributions and strong clustering can\nemerge by assuming an underlying hyperbolic geometry to networks. Furthermore, Adcock et al.\n1For instance, in a two dimensional hyperbolic space with constant curvature K = \u22121, the length of a circle\n2 (er \u2212 e\u2212r) and\n\nis given as 2\u03c0 sinh r while the area of a disc is given as 2\u03c0(cosh r \u2212 1). Since sinh r = 1\ncosh r = 1\n\n2 (er + e\u2212r), both disc area and circle length grow exponentially with r.\n\n2\n\n\fp5\n\np3\n\np4\n\np1\n\np2\n\n(b) Embedding of a tree in B2\n\n(a) Geodesics of the Poincar\u00e9 disk\n(c) Growth of Poincar\u00e9 distance\nFigure 1: (a) Due to the negative curvature of B, the distance of points increases exponentially (relative to their\nEuclidean distance) the closer they are to the boundary. (c) Growth of the Poincar\u00e9 distance d(u, v) relative to\nthe Euclidean distance and the norm of v (for \ufb01xed (cid:107)u(cid:107) = 0.9). (b) Embedding of a regular tree in B2 such that\nall connected nodes are spaced equally far apart (i.e., all black line segments have identical hyperbolic length).\n\n[1] proposed a measure based on Gromov\u2019s \u03b4-hyperbolicity [12] to characterize the tree-likeness of\ngraphs. Ontrup and Ritter [25] proposed hyperbolic self-organizing maps for data exploration. Asta\nand Shalizi [3] used hyperbolic embeddings to compare the global structure of networks. Sun et al.\n[33] proposed Space-Time embeddings to learn representations of non-metric data.\nEuclidean embeddings, on the other hand, have become a popular approach to represent symbolic\ndata in machine learning and arti\ufb01cial intelligence. For instance, in addition to the methods discussed\nin Section 1, Paccanaro and Hinton [26] proposed one of the \ufb01rst embedding methods to learn from\nrelational data. More recently, Holographic [24] and Complex Embeddings [34] have shown state-\nof-the-art performance in Knowledge Graph completion. In relation to hierarchical representations,\nVilnis and McCallum [36] proposed to learn density-based word representations, i.e., Gaussian\nembeddings, to capture uncertainty and asymmetry. Given ordered input pairs, Vendrov et al. [35]\nproposed Order Embeddings to model visual-semantic hierarchies over words, sentences, and images.\nDemeester et al. [10] showed that including prior information about hypernymy relations in form of\nlogical rules can improve the quality of word embeddings.\n\n3 Poincar\u00e9 Embeddings\n\nIn the following, we are interested in \ufb01nding embeddings of symbolic data such that their distance in\nthe embedding space re\ufb02ects their semantic similarity. We assume that there exists a latent hierarchy\nin which the symbols can be organized. In addition to the similarity of objects, we intend to also\nre\ufb02ect this hierarchy in the embedding space to improve over existing methods in two ways:\n\n1. By inducing an appropriate structural bias on the embedding space we aim at improving\n\ngeneralization performance as well as runtime and memory complexity.\n\n2. By capturing the hierarchy explicitly in the embedding space, we aim at gaining additional\ninsights about the relationships between symbols and the importance of individual symbols.\n\nAlthough we assume that there exists a latent hierarchy, we do not assume that we have direct access\nto information about this hierarchy, e.g., via ordered input pairs. Instead, we consider the task of\ninferring the hierarchical relationships fully unsupervised, as is, for instance, necessary for text and\nnetwork data. For these reasons \u2013 and motivated by the discussion in Section 2 \u2013 we embed symbolic\ndata into hyperbolic space H. In contrast to Euclidean space R, there exist multiple, equivalent\nmodels of H such as the Beltrami-Klein model, the hyperboloid model, and the Poincar\u00e9 half-plane\nmodel. In the following, we will base our approach on the Poincar\u00e9 ball model, as it is well-suited for\ngradient-based optimization. In particular, let Bd = {x \u2208 Rd | (cid:107)x(cid:107) < 1} be the open d-dimensional\nunit ball, where (cid:107) \u00b7 (cid:107) denotes the Euclidean norm. The Poincar\u00e9 ball model of hyperbolic space\ncorresponds then to the Riemannian manifold (Bd, gx), i.e., the open unit ball equipped with the\nRiemannian metric tensor\n\n(cid:18)\n\ngx =\n\n(cid:19)2\n\ngE,\n\n2\n\n1 \u2212 (cid:107)x(cid:107)2\n\n3\n\n\fwhere x \u2208 Bd and gE denotes the Euclidean metric tensor. Furthermore, the distance between points\nu, v \u2208 Bd is given as\n\n(cid:18)\n\n(cid:19)\n\n(cid:107)u \u2212 v(cid:107)2\n\n.\n\n1 + 2\n\nd(u, v) = arcosh\n\n(1 \u2212 (cid:107)u(cid:107)2)(1 \u2212 (cid:107)v(cid:107)2)\n\n(1)\nThe boundary of the ball is denoted by \u2202B. It corresponds to the sphere S d\u22121 and is not part of the\nmanifold, but represents in\ufb01nitely distant points. Geodesics in Bd are then circles that are orthogonal\nto \u2202B (as well as all diameters). See Figure 1a for an illustration.\nIt can be seen from Equation (1), that the distance within the Poincar\u00e9 ball changes smoothly with\nrespect to the location of u and v. This locality property of the Poincar\u00e9 distance is key for \ufb01nding\ncontinuous embeddings of hierarchies. For instance, by placing the root node of a tree at the origin of\nBd it would have a relatively small distance to all other nodes as its Euclidean norm is zero. On the\nother hand, leaf nodes can be placed close to the boundary of the Poincar\u00e9 ball as the distance grows\nvery fast between points with a norm close to one. Furthermore, please note that Equation (1) is\nsymmetric and that the hierarchical organization of the space is solely determined by the distance of\npoints to the origin. Due to this self-organizing property, Equation (1) is applicable in an unsupervised\nsetting where the hierarchical order of objects is not speci\ufb01ed in advance such as text and networks.\nRemarkably, Equation (1) allows us therefore to learn embeddings that simultaneously capture the\nhierarchy of objects (through their norm) as well a their similarity (through their distance).\nSince a single hierarchical structure can be well represented in two dimensions, the Poincar\u00e9 disk\n(B2) is a common way to model hyperbolic geometry. In our method, we instead use the Poincar\u00e9 ball\n(Bd), for two main reasons: First, in many datasets such as text corpora, multiple latent hierarchies\ncan co-exist, which can not always be modeled in two dimensions. Second, a larger embedding\ndimension can decrease the dif\ufb01culty for an optimization method to \ufb01nd a good embedding (also for\nsingle hierarchies) as it allows for more degrees of freedom during the optimization process.\nTo compute Poincar\u00e9 embeddings for a set of symbols S = {xi}n\ni=1, we are then interested in \ufb01nding\ni=1, where \u03b8i \u2208 Bd. We assume we are given a problem-speci\ufb01c loss function\nembeddings \u0398 = {\u03b8i}n\nL(\u0398) which encourages semantically similar objects to be close in the embedding space according to\ntheir Poincar\u00e9 distance. To estimate \u0398, we then solve the optimization problem\n\n\u0398(cid:48) \u2190 arg min\n\nL(\u0398)\n\ns.t. \u2200 \u03b8i \u2208 \u0398 : (cid:107)\u03b8i(cid:107) < 1.\n\n\u0398\n\n(2)\n\nWe will discuss speci\ufb01c loss functions in Section 4.\n\n3.1 Optimization\n\nSince the Poincar\u00e9 Ball has a Riemannian manifold structure, we can optimize Equation (2) via\nstochastic Riemannian optimization methods such as RSGD [6] or RSVRG [39]. In particular, let\nT\u03b8B denote the tangent space of a point \u03b8 \u2208 Bd. Furthermore, let \u2207R \u2208 T\u03b8B denote the Riemannian\ngradient of L(\u03b8) and let \u2207E denote the Euclidean gradient of L(\u03b8). Using RSGD, parameter updates\nto minimize Equation (2) are then of the form\n\n\u03b8t+1 = R\u03b8t (\u2212\u03b7t\u2207RL(\u03b8t))\n\nwhere R\u03b8t denotes the retraction onto B at \u03b8 and \u03b7t denotes the learning rate at time t. Hence, for\nthe minimization of Equation (2), we require the Riemannian gradient and a suitable retraction. Since\nthe Poincar\u00e9 ball is a conformal model of hyperbolic space, the angles between adjacent vectors are\nidentical to their angles in the Euclidean space. The length of vectors however might differ. To derive\nthe Riemannian gradient from the Euclidean gradient, it is suf\ufb01cient to rescale \u2207E with the inverse of\nthe Poincar\u00e9 ball metric tensor, i.e., g\u22121\n\u03b8 . Since g\u03b8 is a scalar matrix, the inverse is trivial to compute.\nFurthermore, since Equation (1) is fully differentiable, the Euclidean gradient can easily be derived\nusing standard calculus. In particular, the Euclidean gradient \u2207E = \u2202L(\u03b8)\ndepends on the\ngradient of L, which we assume is known, and the partial derivatives of the Poincar\u00e9 distance, which\ncan be computed as follows: Let \u03b1 = 1 \u2212 (cid:107)\u03b8(cid:107)2 , \u03b2 = 1 \u2212 (cid:107)x(cid:107)2 and let \u03b3 = 1 + 2\n\u03b1\u03b2(cid:107)\u03b8 \u2212 x(cid:107)2 . The\npartial derivate of the Poincar\u00e9 distance with respect to \u03b8 is then given as\n\u03b8 \u2212 x\n\u03b1\n\n(cid:18)(cid:107)x(cid:107)2 \u2212 2(cid:104)\u03b8, x(cid:105) + 1\n\n\u03b2(cid:112)\u03b32 \u2212 1\n\n\u2202d(\u03b8, x)\n\n(cid:19)\n\n\u2202d(\u03b8,x)\n\n\u2202d(\u03b8,x)\n\n(3)\n\n=\n\n\u2202\u03b8\n\n\u03b12\n\n\u2202\u03b8\n\n.\n\n4\n\n4\n\n\fSince d(\u00b7,\u00b7) is symmetric, the partial derivative \u2202d(x,\u03b8)\ncan be derived analogously. As retraction\noperation we use R\u03b8(v) = \u03b8 + v. In combination with the Riemannian gradient, this corresponds\nthen to the well-known natural gradient method [2]. Furthermore, we constrain the embeddings to\nremain within the Poincar\u00e9 ball via the projection\n\n\u2202\u03b8\n\n(cid:26)\u03b8/(cid:107)\u03b8(cid:107) \u2212 \u03b5\n\nproj(\u03b8) =\n\nif (cid:107)\u03b8(cid:107) \u2265 1\notherwise ,\n\n\u03b8\n\n(cid:18)\n\nwhere \u03b5 is a small constant to ensure numerical stability. In all experiments we used \u03b5 = 10\u22125. In\nsummary, the full update for a single embedding is then of the form\n\u2207E\n\n(1 \u2212 (cid:107)\u03b8t(cid:107)2)2\n\n\u03b8t+1 \u2190 proj\n\n\u03b8t \u2212 \u03b7t\n\n(cid:19)\n\n.\n\n(4)\n\n4\n\nIt can be seen from Equations (3) and (4) that this algorithm scales well to large datasets, as the\ncomputational and memory complexity of an update depends linearly on the embedding dimension.\nMoreover, the algorithm is straightforward to parallelize via methods such as Hogwild [30], as the\nupdates are sparse (only a small number of embeddings are modi\ufb01ed in an update) and collisions are\nvery unlikely on large-scale data.\n\n3.2 Training Details\n\nIn addition to this optimization procedure, we found that the following training details were helpful\nfor obtaining good representations: First, we initialize all embeddings randomly from the uniform\ndistribution U(\u22120.001, 0.001). This causes embeddings to be initialized close to the origin of Bd.\nSecond, we found that a good initial angular layout can be helpful to \ufb01nd good embeddings. For this\nreason, we train during an initial \"burn-in\" phase with a reduced learning rate \u03b7/c. In combination\nwith initializing close to the origin, this can improve the angular layout without moving too far\ntowards the boundary. In our experiments, we set c = 10 and the duration of the burn-in to 10 epochs.\n\n4 Evaluation\n\nIn this section, we evaluate the quality of Poincar\u00e9 embeddings for a variety of tasks, i.e., for the\nembedding of taxonomies, for link prediction in networks, and for modeling lexical entailment. In all\ntasks, we train on data where the hierarchy of objects is not explicitly encoded. This allows us to\nevaluate the ability of the embeddings to infer hierachical relationships without supervision. Moreover,\nsince we are mostly interested in the properties of the metric space, we focus on embeddings based\npurely on the Poincar\u00e9 distance and on models with comparable expressivity. In particular, we\ncompare the Poincar\u00e9 distance as de\ufb01ned in Equation (1) to the following two distance functions:\nEuclidean In all cases, we include the Euclidean distance d(u, v) = (cid:107)u \u2212 v(cid:107)2. As the Euclidean\ndistance is \ufb02at and symmetric, we expect that it requires a large dimensionality to model the\nhierarchical structure of the data.\nTranslational For asymmetric data, we also include the score function d(u, v) = (cid:107)u \u2212 v + r(cid:107)2,\nas proposed by Bordes et al. [7] for modeling large-scale graph-structured data. For this score\nfunction, we also learn the global translation vector r during training.\n\nNote that the translational score function has, due to its asymmetry, more information about the\nnature of an embedding problem than a symmetric distance when the order of (u, v) indicates the\nhierarchy of elements. This is, for instance, the case for is-a(u, v) relations in taxonomies. For the\nPoincar\u00e9 distance and the Euclidean distance we could randomly permute the order of (u, v) and\nobtain the identical embedding, while this is not the case for the translational score function. As such,\nit is not fully unsupervised and only applicable where this hierarchical information is available.\n\n4.1 Embedding Taxonomies\n\nIn the \ufb01rst set of experiments, we are interested in evaluating the ability of Poincar\u00e9 embeddings to\nembed data that exhibits a clear latent hierarchical structure. For this purpose, we conduct experiments\non the transitive closure of the WORDNET noun hierarchy [21] in two settings:\n\n5\n\n\fTable 1: Experimental results on the transitive closure of the WORDNET noun hierarchy. Highlighted\ncells indicate the best Euclidean embeddings as well as the Poincar\u00e9 embeddings which achieve equal\nor better results. Bold numbers indicate absolute best results.\n\nn Euclidean\n\nT\nE\nN\nD\nR\nO\nW\n\no\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nR\n\n.\n\nT\nE\nN\nD\nR\nO\nW\n\nd\ne\nr\nP\nk\nn\ni\nL\n\nPoincar\u00e9\n\nRank\nMAP\nTranslational Rank\nMAP\nRank\nMAP\nRank\nMAP\nTranslational Rank\nMAP\nRank\nMAP\n\nEuclidean\n\nPoincar\u00e9\n\nDimensionality\n\n5\n\n3542.3\n0.024\n205.9\n0.517\n4.9\n0.823\n3311.1\n0.024\n65.7\n0.545\n5.7\n0.825\n\n10\n\n2286.9\n0.059\n179.4\n0.503\n4.02\n0.851\n2199.5\n0.059\n56.6\n0.554\n4.3\n0.852\n\n20\n\n1685.9\n0.087\n95.3\n0.563\n3.84\n0.855\n952.3\n0.176\n52.1\n0.554\n4.9\n0.861\n\n50\n\n1281.7\n0.140\n92.8\n0.566\n3.98\n0.86\n351.4\n0.286\n47.2\n0.56\n4.6\n0.863\n\n100\n\n1187.3\n0.162\n92.7\n0.562\n3.9\n0.857\n190.7\n0.428\n43.2\n0.562\n4.6\n0.856\n\n200\n\n1157.3\n0.168\n91.0\n0.565\n3.83\n0.87\n81.5\n0.490\n40.4\n0.559\n4.6\n0.855\n\nReconstruction To evaluate representation capacity, we embed fully observed data and reconstruct\nit from the embedding. The reconstruction error in relation to the embedding dimension is then a\nmeasure for the capacity of the model.\n\nLink Prediction To test generalization performance, we split the data into a train, validation and\ntest set by randomly holding out observed links. The validation and test set do not include links\ninvolving root or leaf nodes as these links would either be trivial or impossible to predict reliably.\n\nSince we are embedding the transitive closure, the hierarchical structure is not directly visible from\nthe raw data but has to be inferred. For Poincar\u00e9 and Euclidean embeddings we additionaly remove\nthe directionality of the edges and embed undirected graphs. The transitive closure of the WORDNET\nnoun hierarchy consists of 82,115 nouns and 743,241 hypernymy relations.\nOn this data, we learn embeddings in both settings as follows: Let D = {(u, v)} be the set of\nobserved hypernymy relations between noun pairs. We then learn embeddings of all symbols in\nD such that related objects are close in the embedding space. In particular, we minimize the loss\nfunction\n\nwhere N (u) = {v(cid:48) | (u, v(cid:48)) (cid:54)\u2208 D} \u222a {v} is the set of negative examples for u (including v). For\ntraining, we randomly sample 10 negative examples per positive example. Equation (5) is similar\nto the loss used in Linear Relational Embeddings [26] (with additional negative sampling) and\nencourages related objects to be closer to each other than objects for which we didn\u2019t observe a\nrelationship. This choice of loss function is motivated by the observation that we don\u2019t want to push\nsymbols that belong to distinct subtrees arbitrarily far apart, as their subtrees might still be close.\nInstead, we only want them to be farther apart than symbols with an observed relation.\nWe evaluate the quality of the embeddings as commonly done for graph embeddings [7, 24]: For each\nobserved relationship (u, v), we rank its distance d(u, v) among the ground-truth negative examples\nfor u, i.e., among the set {d(u, v(cid:48)) | (u, v(cid:48)) (cid:54)\u2208 D)}. In the Reconstruction setting, we evaluate the\nranking on all nouns in the dataset. We then record the mean rank of v as well as the mean average\nprecision (MAP) of the ranking. The results of these experiments are shown in Table 1. It can be\nseen that Poincar\u00e9 embeddings are very successful in the embedding of large taxonomies \u2013 both\nwith regard to their representation capacity and their generalization performance. Even compared to\nTranslational embeddings, which have more information about the structure of the task, Poincar\u00e9\nembeddings show a greatly improved performance while using an embedding that is smaller by an\norder of magnitude. Furthermore, the results of Poincar\u00e9 embeddings in the link prediction task\nare robust with regard to the embedding dimension. We attribute this result to the structural bias of\n\n(cid:88)\n\n(u,v)\u2208D\n\nL(\u0398) =\n\n(cid:80)\n\nlog\n\ne\u2212d(u,v)\n\nv(cid:48)\u2208N (u) e\u2212d(u,v(cid:48))\n\n,\n\n(5)\n\n6\n\n\f(a) Intermediate embedding after 20 epochs\n\n(b) Embedding after convergence\n\nFigure 2: Two-dimensional Poincar\u00e9 embeddings of transitive closure of the WORDNET mammals\nsubtree. Ground-truth is-a relations of the original WORDNET tree are indicated via blue edges. A\nPoincar\u00e9 embedding with d = 5 achieves mean rank 1.26 and MAP 0.927 on this subtree.\n\nthe embedding space which could lead to reduced over\ufb01tting on data with a clear latent hierarchy.\nAdditionally, Figure 2 shows a visualization of a two-dimensional Poincar\u00e9 embedding. For the\npurpose of clarity, this embedding has been trained only on the mammals subtree of WORDNET.\n\n4.2 Network Embeddings\n\nNext, we evaluated the performance of Poincar\u00e9 embeddings for modeling complex networks. Since\nedges in such networks can often be explained via latent hierarchies over their nodes [9], we are\ninterested in the bene\ufb01ts of Poincar\u00e9 embeddings in terms representation size and generalization\nperformance. We performed our experiments on four commonly used social networks, i.e, ASTROPH,\nCONDMAT, GRQC, and HEPPH. These networks represent scienti\ufb01c collaborations such that there\nexists an undirected edge between two persons if they co-authored a paper. For these networks, we\nmodel the probability of an edge as proposed by Krioukov et al. [19] via the Fermi-Dirac distribution\n\nP ((u, v) = 1 | \u0398) =\n\n1\n\ne(d(u,v)\u2212r)/t + 1\n\n(6)\n\nwhere r, t > 0 are hyperparameters. Here, r corresponds to the radius around each point u such that\npoints within this radius are likely to have an edge with u. The parameter t speci\ufb01es the steepness of\nthe logistic function and in\ufb02uences both average clustering as well as the degree distribution [19].\nWe use the cross-entropy loss to learn the embeddings and sample negatives as in Section 4.1.\nFor evaluation, we split each dataset randomly into train, validation, and test set. The hyperparameters\nr and t were tuned for each method on the validation set. Table 2 lists the MAP score of Poincar\u00e9 and\nEuclidean embeddings on the test set for the hyperparameters with the best validation score. Addi-\ntionally, we also list the reconstruction performance without missing data. Translational embeddings\nare not applicable to these datasets as they consist of undirected edges. It can be seen that Poincar\u00e9\nembeddings perform again very well on these datasets and \u2013 especially in the low-dimensional regime\n\u2013 outperform Euclidean embeddings.\n\n4.3 Lexical Entailment\n\nAn interesting aspect of Poincar\u00e9 embeddings is that they allow us to make graded assertions about\nhierarchical relationships, as hierarchies are represented in a continuous space. We test this property\non HYPERLEX [37], which is a gold standard resource for evaluating how well semantic models\ncapture graded lexical entailment by quantifying to what degree X is a type of Y via ratings on a\nscale of [0, 10]. Using the noun part of HYPERLEX, which consists of 2163 rated noun pairs, we\nthen evaluated how well Poincar\u00e9 embeddings re\ufb02ect these graded assertions. For this purpose, we\n\n7\n\n\fTable 2: Mean average precision for Reconstruction and Link Prediction on network data.\n\nDimensionality\n\nReconstruction\n\nLink Prediction\n\n10\n\n0.376\n0.703\n0.356\n0.799\n0.522\n0.990\n0.434\n0.811\n\n20\n\n0.788\n0.897\n0.860\n0.963\n0.931\n0.999\n0.742\n0.960\n\n50\n\n0.969\n0.982\n0.991\n0.996\n0.994\n0.999\n0.937\n0.994\n\n100\n0.989\n0.990\n0.998\n0.998\n0.998\n0.999\n0.966\n0.997\n\n10\n\n0.508\n0.671\n0.308\n0.539\n0.438\n0.660\n0.642\n0.683\n\n20\n\n0.815\n0.860\n0.617\n0.718\n0.584\n0.691\n0.749\n0.743\n\n50\n\n0.946\n0.977\n0.725\n0.756\n0.673\n0.695\n0.779\n0.770\n\n100\n0.960\n0.988\n0.736\n0.758\n0.683\n0.697\n0.783\n0.774\n\nASTROPH\nN=18,772; E=198,110\nCONDMAT\nN=23,133; E=93,497\nGRQC\nN=5,242; E=14,496\nHEPPH\nN=12,008; E=118,521\n\nEuclidean\nPoincar\u00e9\nEuclidean\nPoincar\u00e9\nEuclidean\nPoincar\u00e9\nEuclidean\nPoincar\u00e9\n\nFR\n0.283\n\n\u03c1\n\nTable 3: Spearman\u2019s \u03c1 for Lexical Entailment on HYPERLEX.\n\nSLQS-Sim WN-Basic WN-WuP WN-LCh Vis-ID Euclidean\n\nPoincar\u00e9\n\n0.229\n\n0.240\n\n0.214\n\n0.214\n\n0.253\n\n0.389\n\n0.512\n\nused the Poincar\u00e9 embeddings that were obtained in Section 4.1 by embedding WORDNET with a\ndimensionality d = 5. Note that these embeddings were not speci\ufb01cally trained for this task. To\ndetermine to what extent is-a(u, v) is true, we used the score function:\n\nscore(is-a(u, v)) = \u2212(1 + \u03b1((cid:107)v(cid:107) \u2212 (cid:107)u(cid:107)))d(u, v).\n\n(7)\nHere, the term \u03b1((cid:107)v(cid:107) \u2212 (cid:107)u(cid:107)) acts as a penalty when v is lower in the embedding hierarchy, i.e.,\nwhen v has a higher norm than u. The hyperparameter \u03b1 determines the severity of the penalty. In\nour experiments we set \u03b1 = 103.\nUsing Equation (7), we scored all noun pairs in HYPERLEX and recorded Spearman\u2019s rank correlation\nwith the ground-truth ranking. The results of this experiment are shown in Table 3. It can be seen that\nthe ranking based on Poincar\u00e9 embeddings clearly outperforms all state-of-the-art methods evaluated\nin [37]. Methods in Table 3 that are pre\ufb01xed with WN also use WORDNET as a basis and therefore\nare most comparable. The same embeddings also achieved a state-of-the-art accuracy of 0.86 on\nWBLESS [38, 17], which evaluates non-graded lexical entailment.\n\n5 Discussion and Future Work\n\nIn this paper, we introduced Poincar\u00e9 embeddings for learning representations of symbolic data and\nshowed how they can simultaneously learn the similarity and the hierarchy of objects. Furthermore,\nwe proposed an ef\ufb01cient algorithm to compute the embeddings and showed experimentally, that\nPoincar\u00e9 embeddings provide important advantages over Euclidean embeddings on hierarchical\ndata: First, Poincar\u00e9 embeddings enable parsimonious representations that allow us to learn high-\nquality embeddings of large-scale taxonomies. Second, excellent link prediction results indicate\nthat hyperbolic geometry can introduce an important structural bias for the embedding of complex\nsymbolic data. Third, state-of-the-art results for predicting lexical entailment suggest that the\nhierarchy in the embedding space corresponds well to the underlying semantics of the data.\nThe focus of this work was to evaluate general properties of hyperbolic geometry for the embedding\nof symbolic data. In future work, we intend to expand the applications of Poincar\u00e9 embeddings \u2013 for\ninstance to multi-relational data \u2013 and to derive models that are tailored to speci\ufb01c tasks such as word\nembeddings. Furthermore, we have shown that natural gradient based optimization already produces\nvery good embeddings and scales to large datasets. We expect that a full Riemannian optimization\napproach can further increase the quality of the embeddings and lead to faster convergence.\nAn important aspect of future work regards also the applicability of hyperbolic embeddings in\ndownstream tasks: models that operate on embeddings often make an implicit Euclidean assumption\nand likely require some adaptation to be compatible with hyperbolic spaces.\n\n8\n\n\fReferences\n[1] Aaron B Adcock, Blair D Sullivan, and Michael W Mahoney. Tree-like structure in large social\nand information networks. In Data Mining (ICDM), 2013 IEEE 13th International Conference\non, pages 1\u201310. IEEE, 2013.\n\n[2] Shun-ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):\n\n251\u2013276, 1998.\n\n[3] Dena Marie Asta and Cosma Rohilla Shalizi. Geometric network comparisons. In Proceedings\nof the Thirty-First Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI, pages 102\u2013110.\nAUAI Press, 2015.\n\n[4] M Bogu\u00f1\u00e1, F Papadopoulos, and D Krioukov. Sustaining the internet with hyperbolic mapping.\n\nNature communications, 1:62, 2010.\n\n[5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors\n\nwith subword information. arXiv preprint arXiv:1607.04606, 2016.\n\n[6] Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat.\n\nContr., 58(9):2217\u20132229, 2013.\n\n[7] Antoine Bordes, Nicolas Usunier, Alberto Garc\u00eda-Dur\u00e1n, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In Advances in Neural Information\nProcessing Systems 26, pages 2787\u20132795, 2013.\n\n[8] Guillaume Bouchard, Sameer Singh, and Theo Trouillon. On approximate reasoning capabilities\nof low-rank vector spaces. AAAI Spring Syposium on Knowledge Representation and Reasoning\n(KRR): Integrating Symbolic and Neural Approaches, 2015.\n\n[9] Aaron Clauset, Cristopher Moore, and Mark EJ Newman. Hierarchical structure and the\n\nprediction of missing links in networks. Nature, 453(7191):98\u2013101, 2008.\n\n[10] Thomas Demeester, Tim Rockt\u00e4schel, and Sebastian Riedel. Lifted rule injection for relation\nembeddings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP, pages 1389\u20131399. The Association for Computational Linguistics, 2016.\n[11] John Rupert Firth. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis,\n\n1957.\n\n[12] Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pages 75\u2013263. Springer, 1987.\n[13] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\nIn\nProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 855\u2013864. ACM, 2016.\n\n[14] Zellig S Harris. Distributional structure. Word, 10(2-3):146\u2013162, 1954.\n[15] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social\nnetwork analysis. Journal of the american Statistical association, 97(460):1090\u20131098, 2002.\n[16] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for\n\nef\ufb01cient text classi\ufb01cation. arXiv preprint arXiv:1607.01759, 2016.\n\n[17] Douwe Kiela, Laura Rimell, Ivan Vulic, and Stephen Clark. Exploiting image generality for\nlexical entailment detection. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics (ACL 2015), pages 119\u2013124. ACL, 2015.\n\n[18] Robert Kleinberg. Geographic routing using hyperbolic space. In INFOCOM 2007. 26th IEEE\nInternational Conference on Computer Communications. IEEE, pages 1902\u20131909. IEEE, 2007.\n[19] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Mari\u00e1n Bogun\u00e1.\n\nHyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.\n\n[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed\nrepresentations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.\n\n[21] George Miller and Christiane Fellbaum. Wordnet: An electronic lexical database, 1998.\n[22] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\nlearning on multi-relational data. In Proceedings of the 28th International Conference on\nMachine Learning, ICML, pages 809\u2013816. Omnipress, 2011.\n\n9\n\n\f[23] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factoriza-\ntion models by including observable patterns. In Advances in Neural Information Processing\nSystems 27, pages 1179\u20131187, 2014.\n\n[24] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of\nknowledge graphs. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence,\npages 1955\u20131961. AAAI Press, 2016.\n\n[25] J\u00f6rg Ontrup and Helge Ritter. Large-scale data exploration with the hierarchically growing\n\nhyperbolic SOM. Neural networks, 19(6):751\u2013761, 2006.\n\n[26] Alberto Paccanaro and Geoffrey E. Hinton. Learning distributed representations of concepts\n\nusing linear relational embedding. IEEE Trans. Knowl. Data Eng., 13(2):232\u2013244, 2001.\n\n[27] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for\n\nword representation. In EMNLP, volume 14, pages 1532\u20131543, 2014.\n\n[28] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[29] Erzs\u00e9bet Ravasz and Albert-L\u00e1szl\u00f3 Barab\u00e1si. Hierarchical organization in complex networks.\n\nPhysical Review E, 67(2):026112, 2003.\n\n[30] Benjamin Recht, Christopher R\u00e9, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in Neural Information\n\napproach to parallelizing stochastic gradient descent.\nProcessing Systems 24, pages 693\u2013701, 2011.\n\n[31] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. Relation extraction\nwith matrix factorization and universal schemas. In Human Language Technologies: Conference\nof the North American Chapter of the Association of Computational Linguistics, Proceedings,\npages 74\u201384. The Association for Computational Linguistics, 2013.\n\n[32] Mark Steyvers and Joshua B Tenenbaum. The large-scale structure of semantic networks:\n\nStatistical analyses and a model of semantic growth. Cognitive science, 29(1):41\u201378, 2005.\n\n[33] Ke Sun, Jun Wang, Alexandros Kalousis, and St\u00e9phane Marchand-Maillet. Space-time local\nembeddings. In Advances in Neural Information Processing Systems 28, pages 100\u2013108, 2015.\n[34] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard.\nComplex embeddings for simple link prediction. In Proceedings of the 33nd International\nConference on Machine Learning, ICML, volume 48 of JMLR Workshop and Conference\nProceedings, pages 2071\u20132080. JMLR.org, 2016.\n\n[35] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and\n\nlanguage. arXiv preprint arXiv:1511.06361, 2015.\n\n[36] Luke Vilnis and Andrew McCallum. Word representations via Gaussian embedding.\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\nIn\n\n[37] Ivan Vulic, Daniela Gerz, Douwe Kiela, Felix Hill, and Anna Korhonen. Hyperlex: A large-scale\n\nevaluation of graded lexical entailment. arXiv preprint arXiv:1608.02117, 2016.\n\n[38] Julie Weeds, Daoud Clarke, Jeremy Ref\ufb01n, David Weir, and Bill Keller. Learning to distin-\nguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International\nConference on Computational Linguistics: Technical Papers, pages 2249\u20132259. Dublin City\nUniversity and Association for Computational Linguistics, 2014.\n\n[39] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: fast stochastic optimiza-\ntion on riemannian manifolds. In Advances in Neural Information Processing Systems 29, pages\n4592\u20134600, 2016.\n\n[40] George Kingsley Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to\n\nHuman Ecology. Addison-Wesley, 1949.\n\n10\n\n\f", "award": [], "sourceid": 3176, "authors": [{"given_name": "Maximillian", "family_name": "Nickel", "institution": "Facebook"}, {"given_name": "Douwe", "family_name": "Kiela", "institution": "Facebook AI Research"}]}