{"title": "SimplE Embedding for Link Prediction in Knowledge Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 4284, "page_last": 4295, "abstract": "Knowledge graphs contain knowledge about the world and provide a structured representation of this knowledge. Current knowledge graphs contain only a small subset of what is true in the world. Link prediction approaches aim at predicting new links for a knowledge graph given the existing links among the entities. Tensor factorization approaches have proved promising for such link prediction problems. Proposed in 1927, Canonical Polyadic (CP) decomposition is among the first tensor factorization approaches. CP generally performs poorly for link prediction as it learns two independent embedding vectors for each entity, whereas they are really tied. We present a simple enhancement of CP (which we call SimplE) to allow the two embeddings of each entity to be learned dependently. The complexity of SimplE grows linearly with the size of embeddings. The embeddings learned through SimplE are interpretable, and certain types of background knowledge can be incorporated into these embeddings through weight tying. \nWe prove SimplE is fully expressive and derive a bound on the size of its embeddings for full expressivity. \nWe show empirically that, despite its simplicity, SimplE outperforms several state-of-the-art tensor factorization techniques.\nSimplE's code is available on GitHub at https://github.com/Mehran-k/SimplE.", "full_text": "SimplE Embedding for Link Prediction in Knowledge\n\nGraphs\n\nSeyed Mehran Kazemi\n\nUniversity of British Columbia\n\nVancouver, BC, Canada\nsmkazemi@cs.ubc.ca\n\nDavid Poole\n\nUniversity of British Columbia\n\nVancouver, BC, Canada\n\npoole@cs.ubc.ca\n\nAbstract\n\nKnowledge graphs contain knowledge about the world and provide a structured\nrepresentation of this knowledge. Current knowledge graphs contain only a small\nsubset of what is true in the world. Link prediction approaches aim at predicting\nnew links for a knowledge graph given the existing links among the entities. Tensor\nfactorization approaches have proved promising for such link prediction problems.\nProposed in 1927, Canonical Polyadic (CP) decomposition is among the \ufb01rst tensor\nfactorization approaches. CP generally performs poorly for link prediction as it\nlearns two independent embedding vectors for each entity, whereas they are really\ntied. We present a simple enhancement of CP (which we call SimplE) to allow\nthe two embeddings of each entity to be learned dependently. The complexity\nof SimplE grows linearly with the size of embeddings. The embeddings learned\nthrough SimplE are interpretable, and certain types of background knowledge can\nbe incorporated into these embeddings through weight tying. We prove SimplE\nis fully expressive and derive a bound on the size of its embeddings for full\nexpressivity. We show empirically that, despite its simplicity, SimplE outperforms\nseveral state-of-the-art tensor factorization techniques. SimplE\u2019s code is available\non GitHub at https://github.com/Mehran-k/SimplE.\n\n1\n\nIntroduction\n\nDuring the past two decades, several knowledge graphs (KGs) containing (perhaps probabilistic)\nfacts about the world have been constructed. These KGs have applications in several \ufb01elds including\nsearch, question answering, natural language processing, recommendation systems, etc. Due to the\nenormous number of facts that could be asserted about our world and the dif\ufb01culty in accessing and\nstoring all these facts, KGs are incomplete. However, it is possible to predict new links in a KG\nbased on the existing ones. Link prediction and several other related problems aiming at reasoning\nwith entities and relationships are studied under the umbrella of statistical relational learning (SRL)\n[12, 31, 7]. The problem of link prediction for KGs is also known as knowledge graph completion. A\nKG can be represented as a set of (head , relation, tail ) triples1. The problem of KG completion can\nbe viewed as predicting new triples based on the existing ones.\nTensor factorization approaches have proved to be an effective SRL approach for KG completion\n[29, 4, 39, 26]. These approaches consider embeddings for each entity and each relation. To predict\nwhether a triple holds, they use a function which takes the embeddings for the head and tail entities\nand the relation as input and outputs a number indicating the predicted probability. Details and\ndiscussions of these approaches can be found in several recent surveys [27, 43].\n\n1Triples are complete for relations.\n\n(individual , property, value).\n\nThey are sometimes written as (subject, verb, object) or\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOne of the \ufb01rst tensor factorization approaches is the canonical Polyadic (CP) decomposition [15].\nThis approach learns one embedding vector for each relation and two embedding vectors for each\nentity, one to be used when the entity is the head and one to be used when the entity is the tail. The\nhead embedding of an entity is learned independently of (and is unrelated to) its tail embedding. This\nindependence has caused CP to perform poorly for KG completion [40]. In this paper, we develop a\ntensor factorization approach based on CP that addresses the independence among the two embedding\nvectors of the entities. Due to the simplicity of our model, we call it SimplE (Simple Embedding).\nWe show that SimplE: 1- can be considered a bilinear model, 2- is fully expressive, 3- is capable\nof encoding background knowledge into its embeddings through parameter sharing (aka weight\ntying), and 4- performs very well empirically despite (or maybe because of) its simplicity. We also\ndiscuss several disadvantages of other existing approaches. We prove that many existing translational\napproaches (see e.g., [4, 17, 41, 26]) are not fully expressive and we identify severe restrictions on\nwhat they can represent. We also show that the function used in ComplEx [39, 40], a state-of-the-art\napproach for link prediction, involves redundant computations.\n\n=(cid:80)d\n\n2 Background and Notation\nWe represent vectors with lowercase letters and matrices with uppercase letters. Let v, w, x \u2208 Rd\nbe vectors of length d. We de\ufb01ne (cid:104)v, w, x(cid:105) .\nj=1 v[j] \u2217 w[j] \u2217 x[j], where v[j], w[j], and x[j]\nrepresent the jth element of v, w and x respectively. That is, (cid:104)v, w, x(cid:105) .\n= (v (cid:12) w) \u00b7 x where (cid:12)\nrepresents element-wise (Hadamard) multiplication and \u00b7 represents dot product. I d represents an\nidentity matrix of size d. [v1; v2; . . . ; vn] represents the concatenation of n vectors v1, v2, . . . and vn.\nLet E and R represent the set of entities and relations respectively. A triple is represented as (h, r , t),\nwhere h \u2208 E is the head, r \u2208 R is the relation, and t \u2208 E is the tail of the triple. Let \u03b6 represent the\nset of all triples that are true in a world (e.g., (paris, capitalOf , france)), and \u03b6(cid:48) represent the ones\nthat are false (e.g., (paris, capitalOf , italy)). A knowledge graph KG is a subset of \u03b6. A relation\nr is re\ufb02exive on a set E of entities if (e, r , e) \u2208 \u03b6 for all entities e \u2208 E. A relation r is symmetric\non a set E of entities if (e1 , r , e2 ) \u2208 \u03b6 \u21d0\u21d2 (e2 , r , e1 ) \u2208 \u03b6 for all pairs of entities e1, e2 \u2208 E,\nand is anti-symmetric if (e1 , r , e2 ) \u2208 \u03b6 \u21d0\u21d2 (e2 , r , e1 ) \u2208 \u03b6(cid:48). A relation r is transitive on\na set E of entities if (e1 , r , e2 ) \u2208 \u03b6 \u2227 (e2 , r , e3 ) \u2208 \u03b6 \u21d2 (e1 , r , e3 ) \u2208 \u03b6 for all e1, e2, e3 \u2208 E.\nThe inverse of a relation r, denoted as r\u22121, is a relation such that for any two entities ei and ej,\n(ei , r , ej ) \u2208 \u03b6 \u21d0\u21d2 (ej , r\u22121 , ei ) \u2208 \u03b6.\nAn embedding is a function from an entity or a relation to one or more vectors or matrices of\nnumbers. A tensor factorization model de\ufb01nes two things: 1- the embedding functions for entities\nand relations, 2- a function f taking the embeddings for h, r and t as input and generating a prediction\nof whether (h, r , t) is in \u03b6 or not. The values of the embeddings are learned using the triples in a\nKG. A tensor factorization model is fully expressive if given any ground truth (full assignment of\ntruth values to all triples), there exists an assignment of values to the embeddings of the entities and\nrelations that accurately separates the correct triples from incorrect ones.\n\n3 Related Work\n\nTranslational Approaches de\ufb01ne additive functions over embeddings. In many translational ap-\nproaches, the embedding for each entity e is a single vector ve \u2208 Rd and the embedding for each\nrelation r is a vector vr \u2208 Rd(cid:48)\nand two matrices Pr \u2208 Rd(cid:48)\u00d7d and Qr \u2208 Rd(cid:48)\u00d7d. The dissimilarity\nfunction for a triple (h, r , t) is de\ufb01ned as ||Prvh + vr \u2212 Qrvt||i (i.e. encouraging Prvh + vr \u2248 Qrvt)\nwhere ||v||i represents norm i of vector v. Translational approaches having this dissimilarity function\nusually differ on the restrictions they impose on Pr and Qr. In TransE [4], d = d(cid:48), Pr = Qr = I d.\nIn TransR [22], Pr = Qr. In STransE [26], no restrictions are imposed on the matrices. FTransE\n[11], slightly changes the dissimilarity function de\ufb01ning it as ||Prvh + vr \u2212 \u03b1Qrvt||i for a value of\n\u03b1 that minimizes the norm for each triple. In the rest of the paper, we let FSTransE represent the\nFTransE model where no restrictions are imposed over Pr and Qr.\nMultiplicative Approaches de\ufb01ne product-based functions over embeddings. DistMult [46], one of\nthe simplest multiplicative approaches, considers the embeddings for each entity and each relation\nto be ve \u2208 Rd and vr \u2208 Rd respectively and de\ufb01nes its similarity function for a triple (h, r , t)\n\n2\n\n\ffunction of ComplEx for a triple (h, r , t) is de\ufb01ned as Real((cid:80)d\n\nas (cid:104)vh, vr, vt(cid:105). Since DistMult does not distinguish between head and tail entities, it can only\nmodel symmetric relations. ComplEx [39] extends DistMult by considering complex-valued instead\nof real-valued vectors for entities and relations. For each entity e, let ree \u2208 Rd and ime \u2208 Rd\nrepresent the real and imaginary parts of the embedding for e. For each relation r, let rer \u2208 Rd\nand imr \u2208 Rd represent the real and imaginary parts of the embedding for r. Then the similarity\nj=1(reh[j] + imh[j]i) \u2217 (rer[j] +\nimr[j]i) \u2217 (ret[j] \u2212 imt[j]i)), where Real(\u03b1 + \u03b2i) = \u03b1 and i2 = \u22121. One can easily verify that\nthe function used by ComplEx can be expanded and written as (cid:104)reh, rer, ret(cid:105) + (cid:104)reh, imr, imt(cid:105) +\n(cid:104)imh, rer, imt(cid:105) \u2212 (cid:104)imh, imr, ret(cid:105). In RESCAL [28], the embedding vector for each entity e is\nve \u2208 Rd and for each relation r is vr \u2208 Rd\u00d7d and the similarity function for a triple (h, r , t) is\nvr \u00b7 vec(vh \u2297 vt), where \u2297 represents the outer product of two vectors and vec(.) vectorizes the input\nmatrix. HolE [32] is a multiplicative model that is isomorphic to ComplEx [14].\nDeep Learning Approaches generally use a neural network that learns how the head, relation, and\ntail embeddings interact. E-MLP [37] considers the embeddings for each entity e to be a vector\nve \u2208 Rd, and for each relation r to be a matrix Mr \u2208 R2k\u00d7m and a vector vr \u2208 Rm. To make a\nprediction about a triple (h, r , t), E-MLP feeds [vh; vt] \u2208 R2d into a two-layer neural network whose\nweights for the \ufb01rst layer are the matrix Mr and for the second layer are vr. ER-MLP [10], considers\nthe embeddings for both entities and relations to be single vectors and feeds [vh; vr; vt] \u2208 R3d into a\ntwo layer neural network. In [35], once the entity vectors are provided by the convolutional neural\nnetwork and the relation vector is provided by the long-short time memory network, for each triple\nthe vectors are concatenated similar to ER-MLP and are fed into a four-layer neural network. Neural\ntensor network (NTN) [37] combines E-MLP with several bilinear parts (see Subsection 5.4 for a\nde\ufb01nition of bilinear models).\n\n4 SimplE: A Simple Yet Fully Expressive Model\n\nIn canonical Polyadic (CP) decomposition [15], the embedding for each entity e has two vectors\nhe, te \u2208 Rd, and for each relation r has a single vector vr \u2208 Rd. he captures e\u2019s behaviour as the\nhead of a relation and te captures e\u2019s behaviour as the tail of a relation. The similarity function\nfor a triple (e1 , r , e2 ) is (cid:104)he1, vr, te2(cid:105). In CP, the two embedding vectors for entities are learned\nindependently of each other: observing (e1 , r , e2 ) \u2208 \u03b6 only updates he1 and te2, not te1 and he2.\nExample 1. Let likes(p, m) represent if a person p likes a movie m and acted(m, a) represent who\nacted in which movie. Which actors play in a movie is expected to affect who likes the movie. In CP,\nobservations about likes only update the t vector of movies and observations about acted only update\nthe h vector. Therefore, what is being learned about movies through observations about acted does\nnot affect the predictions about likes and vice versa.\n\nSimplE takes advantage of the inverse of relations to address the independence of the two vectors for\neach entity in CP. While inverse of relations has been used for other purposes (see e.g., [20, 21, 6]),\nusing them to address the independence of the entity vectors in CP is a novel contribution.\nModel De\ufb01nition: SimplE considers two vectors he, te \u2208 Rd as the embedding of each entity e\n(similar to CP), and two vectors vr, vr\u22121 \u2208 Rd for each relation r. The similarity function of SimplE\n2 ((cid:104)hei, vr, tej(cid:105) + (cid:104)hej , vr\u22121 , tei(cid:105)), i.e. the average of the CP\nfor a triple (ei , r , ej ) is de\ufb01ned as 1\nscores for (ei , r , ej ) and (ej , r\u22121 , ei ). In our experiments, we also consider a different variant,\nwhich we call SimplE-ignr. During training, for each correct (incorrect) triple (ei , r , ej ), SimplE-ignr\nupdates the embeddings such that each of the two scores (cid:104)hei, vr, tej(cid:105) and (cid:104)hej , vr\u22121 , tei(cid:105) become\nlarger (smaller). During testing, SimplE-ignr ignores r\u22121s and de\ufb01nes the similarity function to be\n(cid:104)hei, vr, tej(cid:105).\nLearning SimplE Models: To learn a SimplE model, we use stochastic gradient descent with mini-\nbatches. In each learning iteration, we iteratively take in a batch of positive triples from the KG, then\nfor each positive triple in the batch we generate n negative triples by corrupting the positive triple. We\nuse Bordes et al. [4]\u2019s procedure to corrupt positive triples. The procedure is as follows. For a positive\ntriple (h, r , t), we randomly decide to corrupt the head or tail. If the head is selected, we replace h in\nthe triple with an entity h(cid:48) randomly selected from E \u2212{h} and generate the corrupted triple (h(cid:48), r , t).\nIf the tail is selected, we replace t in the triple with an entity t(cid:48) randomly selected from E \u2212 {t} and\ngenerate the corrupted triple (h, r , t(cid:48)). We generate a labelled batch LB by labelling positive triples as\n\n3\n\n\fFigure 1: hes and vrs in the proof of Proposition 1.\n\nh(e0)\nh(e1)\nh(e2)\n. . .\n\nh(e|E|\u22121)\n\nv(r0)\nv(r1)\n. . .\n\nv(r|R|\u22121)\n\n0 . . . 0\n0 . . . 0\n1 . . . 0\n\n0 . . . 0 . . . 1\n0 . . . 0 . . . 0\n1 . . . 0 . . . 0\n\n1\n0\n0\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n0\n1\n0\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n0\n\n0 . . . 1 . . . 0\n0 . . . 0 . . . 0\n1 . . . 1 . . . 0\n\n0 . . . 1\n1 . . . 1\n0 . . . 0\n\n0 . . . 0 . . . 1\n\n0\n1\n0\n\n0\n0\n0\n\n1\n\n0 . . . 0\n\n0\n1\n0\n\n0\n0\n1\n\n0\n\n1\n0\n0\n\n0\n0\n1\n\n0\n\n0\n1\n0\n\n0\n1\n0\n\n0\n\n0 . . . 0\n0 . . . 0\n1 . . . 0\n\n0 . . . 1\n0 . . . 0\n0 . . . 0\n\n1 . . . 1\n\n(cid:80)\n((h,r ,t),l)\u2208LB sof tplus(\u2212l\u00b7 \u03c6(h, r , t)) + \u03bb||\u03b8||2\n\n+1 and negatives as \u22121. Once we have a labelled batch, following [39] we optimize the L2 regularized\nnegative log-likelihood of the batch: min\u03b8\n2, where\n\u03b8 represents the parameters of the model (the parameters in the embeddings), l represents the label\nof a triple, \u03c6(h, r , t) represents the similarity score for triple (h, r , t), \u03bb is the regularization hyper-\nparameter, and sof tplus(x) = log(1 + exp(x)). While several previous works (e.g., TransE, TransR,\nSTransE, etc.) consider a margin-based loss function, Trouillon and Nickel [38] show that the\nmargin-based loss function is more prone to over\ufb01tting compared to log-likelihood.\n\n5 Theoretical Analyses\n\nIn this section, we provide some theoretical analyses of SimplE and other existing approaches.\n\n5.1 Fully Expressiveness\n\nThe following proposition establishes the full expressivity of SimplE.\nProposition 1. For any ground truth over entities E and relations R containing \u03b3 true facts, there\nexists a SimplE model with embedding vectors of size min(|E| \u00b7 |R|, \u03b3 + 1) that represents that\nground truth.\nProof. First, we prove the |E| \u00b7 |R| bound. With embedding vectors of size |E| \u2217 |R|, for each entity\nei we let the n-th element of hei = 1 if (n mod |E|) = i and 0 otherwise, and for each relation rj we\nlet the n-th element of vrj = 1 if (n div |E|) = j and 0 otherwise (see Fig 1). Then for each ei and\nrj, the product of hei and vrj is 0 everywhere except for the (j \u2217 |E| + i)-th element. So for each\nentity ek, we set the (j \u2217 |E| + i)-th element of tek to be 1 if (ei , rj , ek ) holds and \u22121 otherwise.\nNow we prove the \u03b3 + 1 bound. Let \u03b3 be zero (base of the induction). We can have embedding\nvectors of size 1 for each entity and relation, setting the value for entities to 1 and for relations to \u22121.\nThen (cid:104)hei , vrj , tek(cid:105) is negative for every entities ei and ek and relation rj. So there exists embedding\nvectors of size \u03b3 + 1 that represents this ground truth. Let us assume for any ground truth where\n\u03b3 = n \u2212 1 (1 \u2264 n \u2264 |R||E|2), there exists an assignment of values to embedding vectors of size\nn that represents that ground truth (assumption of the induction). We must prove for any ground\ntruth where \u03b3 = n, there exists an assignment of values to embedding vectors of size n + 1 that\nrepresents this ground truth. Let (ei , rj , ek ) be one of the n true facts. Consider a modi\ufb01ed ground\ntruth which is identical to the ground truth with n true facts, except that (ei , rj , ek ) is assigned false.\nThe modi\ufb01ed ground truth has n \u2212 1 true facts and based on the assumption of the induction, we can\nrepresent it using some embedding vectors of size n. Let q = (cid:104)hei, vrj , tek(cid:105) where hei, vrj and tek\nare the embedding vectors that represent the modi\ufb01ed ground truth. We add an element to the end of\nall embedding vectors and set it to 0. This increases the vector sizes to n + 1 but does not change\nany scores. Then we set the last element of hei to 1, vrj to 1, and tek to q + 1. This ensures that\n(cid:104)hei, vrj , tek(cid:105) > 0 for the new vectors, and no other score is affected.\n\nDistMult is not fully expressive as it forces relations to be symmetric. It has been shown in [40] that\nComplEx is fully expressive with embeddings of length at most |E| \u00b7 |R|. According to the universal\napproximation theorem [5, 16], under certain conditions, neural networks are universal approximators\nof continuous functions over compact sets. Therefore, we would expect there to be a representation\n\n4\n\n\fbased on neural networks that can approximate any ground truth, but the number of hidden units\nmight have to grow with the number of triples. Wang et al. [44] prove that TransE is not fully\nexpressive. Proposition 2 proves that not only TransE but also many other translational approaches\nare not fully expressive. The proposition also identi\ufb01es severe restrictions on what relations these\napproaches can represent.\nProposition 2. FSTransE is not fully expressive and has the following restrictions. R1 : If a relation\nr is re\ufb02exive on \u2206 \u2282 E, r must also be symmetric on \u2206, R2 : If r is re\ufb02exive on \u2206 \u2282 E, r must also\nbe transitive on \u2206, and R3 : If entity e1 has relation r with every entity in \u2206 \u2282 E and entity e2 has\nrelation r with one of the entities in \u2206, then e2 must have the relation r with every entity in \u2206.\n\n(prs1 + vr) = \u03b12\n\u03b13\n\n\u03b11qrs1 = \u03b14qrs1, where \u03b14 = \u03b12\u03b11\n\u03b13\n\nProof. For any entity e and relation r, let pre = Prve and qre = Qrve. For a triple (h, r , t) to hold,\nwe should ideally have prh + vr = \u03b1qrt for some \u03b1. We assume s1, s2, s3 and s4 are entities in \u2206.\nR1 : A relation r being re\ufb02exive on \u2206 implies prs1 + vr = \u03b11qrs1 and prs2 + vr = \u03b12qrs2. Suppose\n(s1 , r , s2 ) holds as well. Then we know prs1 + vr = \u03b13qrs2. Therefore, prs2 + vr = \u03b12qrs2 =\n\u03b12\n\u03b13\nR2 : A relation r being re\ufb02exive implies prs1 + vr = \u03b11qrs1, prs2 + vr = \u03b12qrs2, and prs3 +\nvr = \u03b13qrs3. Suppose (s1 , r , s2 ) and (s2 , r , s3 ) hold. Then we know prs1 + vr = \u03b14qrs2 and\nprs2 + vr = \u03b15qrs3. We can conclude prs1 + vr = \u03b14qrs2 = \u03b14\n\u03b15qrs3 = \u03b16qrs3,\n\u03b12\nwhere \u03b16 = \u03b14\u03b15\n\u03b12\nR3 : Let e2 have relation r with s1. We know pre1 + vr = \u03b11qrs1, pre1 + vr = \u03b12qrs2, and\npre2 + vr = \u03b13qrs1. We can conclude pre2 + vr = \u03b13qrs1 = \u03b13\n\u03b12qrs2 = \u03b14qrs2,\n\u03b11\nwhere \u03b14 = \u03b13\u03b12\n\u03b11\nCorollary 1. Other variants of translational approaches such as TransE, FTransE, STransE, TransH\n[41], and TransR [22] also have the restrictions mentioned in Proposition 2.\n\n. The above equality proves (s1 , r , s3 ) must hold.\n\n. Therefore, (s2 , r , s1 ) must holds.\n\n(prs2 + vr) = \u03b14\n\u03b12\n\n(pre1 + vr) = \u03b13\n\u03b11\n\n. Therefore, (e2 , r , s2 ) must hold.\n\n5.2\n\nIncorporating Background Knowledge into the Embeddings\n\nIn SimplE, each element of the embedding vector of the entities can be considered as a feature of the\nentity and the corresponding element of a relation can be considered as a measure of how important\nthat feature is to the relation. Such interpretability allows the embeddings learned through SimplE for\nan entity (or relation) to be potentially transferred to other domains. It also allows for incorporating\nobserved features of entities into the embeddings by \ufb01xing one of the elements of the embedding\nvector of the observed value. Nickel et al. [30] show that incorporating such features helps reduce the\nsize of the embeddings.\nRecently, incorporating background knowledge into tensor factorization approaches has been the\nfocus of several studies. Towards this goal, many existing approaches rely on post-processing steps\nor add additional terms to the loss function to penalize predictions that violate the background\nknowledge [34, 42, 45, 13, 9]. Minervini et al. [25] show how background knowledge in terms\nof equivalence and inversion can be incorporated into several tensor factorization models through\nparameter tying2. Incorporating background knowledge by parameter tying has the advantage of\nguaranteeing the predictions follow the background knowledge for all embeddings. In this section, we\nshow how three types of background knowledge, namely symmetry, anti-symmetry, and inversion, can\nbe incorporated into the embeddings of SimplE by tying the parameters3 (we ignore the equivalence\nbetween two relations as it is trivial).\nProposition 3. Let r be a relation such that for any two entities ei and ej we have (ei , r , ej ) \u2208\n\u03b6 \u21d0\u21d2 (ej , r , ei ) \u2208 \u03b6 (i.e. r is symmetric). This property of r can be encoded into SimplE by tying\nthe parameters vr\u22121 to vr.\nProof. If (ei , r , ej ) \u2208 \u03b6, then a SimplE model makes (cid:104)hei, vr, tej(cid:105) and (cid:104)hej , vr\u22121, tei(cid:105) positive. By\ntying the parameters vr\u22121 to vr, we can conclude that (cid:104)hej , vr, tei(cid:105) and (cid:104)hei, vr\u22121, tej(cid:105) also become\npositive. Therefore, the SimplE model predicts (ej , r , ei ) \u2208 \u03b6.\n\n2Although their incorporation of inversion into DistMult is not correct as it has side effects.\n3Note that such background knowledge can be exerted on some relations selectively and not on the others.\n\nThis is different than, e.g., DistMult which enforces symmetry on all relations.\n\n5\n\n\fProposition 4. Let r be a relation such that for any two entities ei and ej we have (ei , r , ej ) \u2208\n\u03b6 \u21d0\u21d2 (ej , r , ei ) \u2208 \u03b6(cid:48) (i.e. r is anti-symmetric). This property of r can be encoded into SimplE by\ntying the parameters vr\u22121 to the negative of vr.\nProof. If (ei , r , ej ) \u2208 \u03b6, then a SimplE model makes (cid:104)hei, vr, tej(cid:105) and (cid:104)hej , vr\u22121, tei(cid:105) positive. By\ntying the parameters vr\u22121 to the negative of vr, we can conclude that (cid:104)hej , vr, tei(cid:105) and (cid:104)hei, vr\u22121, tej(cid:105)\nbecome negative. Therefore, the SimplE model predicts (ej , r , ei ) \u2208 \u03b6(cid:48).\nProposition 5. Let r1 and r2 be two relations such that for any two entities ei and ej we have\n(ei , r1 , ej ) \u2208 \u03b6 \u21d0\u21d2 (ej , r2 , ei ) \u2208 \u03b6 (i.e. r2 is the inverse of r1). This property of r1 and r2 can be\nencoded into SimplE by tying the parameters vr\nProof. If (ei , r1 , ej ) \u2208 \u03b6, then a SimplE model makes (cid:104)hei , vr1, tej(cid:105) and (cid:104)hej , vr\nBy tying the parameters vr\n(cid:104)hej , vr2 , tei(cid:105) also become positive. Therefore, the SimplE model predicts (ej , r2 , ei ) \u2208 \u03b6.\n\n, we can conclude that (cid:104)hei, vr\n\nto vr1 and vr2 to vr\n\n, tei(cid:105) positive.\n, tej(cid:105) and\n\n\u22121\n2\n\nto vr1.\n\n\u22121\n1\n\nto vr2 and vr\n\n\u22121\n2\n\n\u22121\n2\n\n\u22121\n1\n\n\u22121\n1\n\n5.3 Time Complexity and Parameter Growth\n\nAs described in [3], to scale to the size of the current KGs and keep up with their growth, a relational\nmodel must have a linear time and memory complexity. Furthermore, one of the important challenges\nin designing tensor factorization models is the trade-off between expressivity and model complexity.\nModels with many parameters usually over\ufb01t and give poor performance. While the time complexity\nfor TransE is O(d) where d is the size of the embedding vectors, adding the projections as in STransE\n(through the two relation matrices) increases the time complexity to O(d2). Besides time complexity,\nthe number of parameters to be learned from data grows quadratically with d. A quadratic time\ncomplexity and parameter growth may arise two issues: 1- scalability problems, 2- over\ufb01tting. Same\nissues exist for models such as RESCAL and NTNs that have quadratic or higher time complexities\nand parameter growths. DistMult and ComplEx have linear time complexities and the number of\ntheir parameters grow linearly with d.\nThe time complexity of both SimplE-ignr and SimplE is O(d), i.e.\nlinear in the size of vector\nembeddings. SimplE-ignr requires one multiplication between three vectors for each triple. This\nnumber is 2 for SimplE and 4 for ComplEx. Thus, with the same number of parameters, SimplE-ignr\nand SimplE reduce the computations by a factor of 4 and 2 respectively compared to ComplEx.\n\n5.4 Family of Bilinear Models\nBilinear models correspond to the family of models where the embedding for each entity e is ve \u2208 Rd,\nfor each relation r is Mr \u2208 Rd\u00d7d (with certain restrictions), and the similarity function for a triple\n(h, r , t) is de\ufb01ned as vT\nh Mrvt. These models have shown remarkable performance for link prediction\nin knowledge graphs [31]. DistMult, ComplEx, and RESCAL are known to belong to the family of\nbilinear models. We show that SimplE (and CP) also belong to this family.\nDistMult can be considered a bilinear model which restricts the Mr matrices to be diagonal as\nin Fig. 2(a). For ComplEx, if we consider the embedding for each entity e to be a single vector\n[ree; ime] \u2208 R2d, then it can be considered a bilinear model with its Mr matrices constrained\naccording to Fig. 2(b). RESCAL can be considered a bilinear model which imposes no constraints on\nthe Mr matrices. Considering the embedding for each entity e to be a single vector [he; te] \u2208 R2d,\nCP can be viewed as a bilinear model with its Mr matrices constrained as in Fig 2(c). For a triple\n(e1 , r , e2 ), multiplying [he1; te1 ] to Mr results in a vector ve1r whose \ufb01rst half is zero and whose\nsecond half corresponds to an element-wise product of he1 to the parameters in Mr. Multiplying\nve1r to [he2 ; te2 ] corresponds to ignoring he2 (since the \ufb01rst half of ve1r is zeros) and taking the\ndot-product of the second half of ve1r with te2. SimplE can be viewed as a bilinear model similar to\nCP except that the Mr matrices are constrained as in Fig 2(d). The extra parameters added to the\nmatrix compared to CP correspond to the parameters in the inverse of the relations.\nThe constraint over Mr matrices in SimplE is very similar to the constraint in DistMult. vT\nh Mr in both\nSimplE and DistMult can be considered as an element-wise product of the parameters, except that\nthe Mrs in SimplE swap the \ufb01rst and second halves of the resulting vector. Compared to ComplEx,\nSimplE removes the parameters on the main diagonal of Mrs. Note that several other restrictions on\n\n6\n\n\fFigure 2: The constraints over Mr matrices for bilinear models (a) DistMult, (b) ComplEx, (c) CP,\nand (d) SimplE. The lines represent where the parameters are; other elements of the matrices are\nconstrained to be zero. In ComplEx, the parameters represented by the dashed line is tied to the\nparameters represented by the solid line and the parameters represented by the dotted line is tied to\nthe negative of the dotted-and-dashed line.\n\nthe Mr matrices are equivalent to SimplE. Viewing SimplE as a single-vector-per-entity model makes\nit easily integrable (or compatible) with other embedding models (in knowledge graph completion,\ncomputer vision and natural language processing) such as [35, 47, 36].\n\n5.5 Redundancy in ComplEx\n\nAs argued earlier, with the same number of parameters, the number of computations in ComplEx\nare 4x and 2x more than SimplE-ignr and SimplE. Here we show that a portion of the computations\nperformed by ComplEx to make predictions is redundant. Consider a ComplEx model with embedding\nvectors of size 1 (for ease of exposition). Suppose the embedding vectors for h, r and t are [\u03b11 + \u03b21i],\n[\u03b12 + \u03b22i], and [\u03b13 + \u03b23i] respectively. Then the probability of (h, r , t) being correct according to\nComplEx is proportional to the sum of the following four terms: 1) \u03b11\u03b12\u03b13, 2) \u03b11\u03b22\u03b23, 3) \u03b21\u03b12\u03b23,\nand 4) \u2212\u03b21\u03b22\u03b13. It can be veri\ufb01ed that for any assignment of (non-zero) values to \u03b1is and \u03b2is, at\nleast one of the above terms is negative. This means for a correct triple, ComplEx uses three terms to\noverestimate its score and then uses a term to cancel the overestimation.\nThe following example shows how this redundancy in ComplEx may affect its interpretability:\nExample 2. Consider a ComplEx model with embeddings of size 1. Consider entities e1, e2 and e3\nwith embedding vectors [1 + 4i], [1 + 6i], and [3 + 2i] respectively, and a relation r with embedding\nvector [1 + i]. According to ComplEx, the score for triple (e1 , r , e3 ) is positive suggesting e1\nprobably has relation r with e3. However the score for triple (e2 , r , e3 ) is negative suggesting e2\nprobably does not have relation r with e3. Since the only difference between e1 and e2 is that the\nimaginary part changes from 4 to 6, it is dif\ufb01cult to associate a meaning to these numbers.\n\n6 Experiments and Results\n\nDatasets: We conducted experiments on two standard benchmarks: WN18 a subset of Wordnet [24],\nand FB15k a subset of Freebase [2]. We used the same train/valid/test sets as in [4]. WN18 contains\n40, 943 entities, 18 relations, 141, 442 train, 5, 000 validation and 5, 000 test triples. FB15k contains\n14, 951 entities, 1, 345 relations, 483, 142 train, 50, 000 validation, and 59, 071 test triples.\nBaselines: We compare SimplE with several existing tensor factorization approaches. Our baselines\ninclude canonical Polyadic (CP) decomposition, TransE, TransR, DistMult, NTN, STransE, ER-MLP,\nand ComplEx. Given that we use the same data splits and objective function as ComplEx, we report\nthe results of CP, TransE, DistMult, and ComplEx from [39]. We report the results of TransR and\nNTN from [27], and ER-MLP from [32] for further comparison.\nEvaluation Metrics: To measure and compare the performances of different models, for each test\ntriple (h, r , t) we compute the score of (h(cid:48), r , t) triples for all h(cid:48) \u2208 E and calculate the ranking\nrankh of the triple having h, and we compute the score of (h, r , t(cid:48)) triples for all t(cid:48) \u2208 E and calculate\nthe ranking rankt of the triple having t. Then we compute the mean reciprocal rank (MRR) of\nthese rankings as the mean of the inverse of the rankings: M RR = 1\n,\nrankt\nwhere tt represents the test triples. MRR is a more robust measure than mean rank, since a single bad\nranking can largely in\ufb02uence mean rank.\n\n(h,r ,t)\u2208tt\n\n(cid:80)\n\n2\u2217|tt|\n\n1\n\nrankh\n\n+ 1\n\n7\n\n (a) (b) (c) (d) \fTable 1: Results on WN18 and FB15k. Best results are in bold.\n\nMRR\n\nFilter\n0.075\n0.454\n0.605\n0.822\n0.530\n0.657\n0.712\n0.941\n0.939\n0.942\n\nRaw\n0.058\n0.335\n0.427\n0.532\n\u2212\n0.469\n0.528\n0.587\n0.576\n0.588\n\nWN18\n\n1\n\n0.049\n0.089\n0.335\n0.728\n\u2212\n\u2212\n0.626\n0.936\n0.938\n0.939\n\nHit@\n\n3\n\n0.080\n0.823\n0.876\n0.914\n\u2212\n\u2212\n0.775\n0.945\n0.940\n0.944\n\nMRR\n\n10\n\n0.125\n0.934\n0.940\n0.936\n0.661\n0.934\n0.863\n0.947\n0.941\n0.947\n\nFilter\n0.326\n0.380\n0.346\n0.654\n0.250\n0.543\n0.288\n0.692\n0.700\n0.727\n\nRaw\n0.152\n0.221\n0.198\n0.242\n\u2212\n\n0.252\n0.155\n0.242\n0.237\n0.239\n\nFB15k\n\n1\n\n0.219\n0.231\n0.218\n0.546\n\u2212\n\u2212\n0.173\n0.599\n0.625\n0.660\n\nHit@\n\n3\n\n0.376\n0.472\n0.404\n0.733\n\u2212\n\u2212\n0.317\n0.759\n0.754\n0.773\n\n10\n\n0.532\n0.641\n0.582\n0.824\n0.414\n0.797\n0.501\n0.840\n0.821\n0.838\n\nModel\n\nCP\n\nTransE\nTransR\nDistMult\n\nNTN\n\nSTransE\nER-MLP\nComplEx\nSimplE-ignr\n\nSimplE\n\nBordes et al. [4] identi\ufb01ed an issue with the above procedure for calculating the MRR (hereafter\nreferred to as raw MRR). For a test triple (h, r , t), since there can be several entities h(cid:48) \u2208 E for which\n(h(cid:48), r , t) holds, measuring the quality of a model based on its ranking for (h, r , t) may be \ufb02awed.\nThat is because two models may rank the test triple (h, r , t) to be second, when the \ufb01rst model ranks\na correct triple (e.g., from train or validation set) (h(cid:48), r , t) to be \ufb01rst and the second model ranks\nan incorrect triple (h(cid:48)(cid:48), r , t) to be \ufb01rst. Both these models will get the same score for this test triple\nwhen the \ufb01rst model should get a higher score. To address this issue, [4] proposed a modi\ufb01cation\nto raw MRR. For each test triple (h, r , t), instead of \ufb01nding the rank of this triple among triples\n(h(cid:48), r , t) for all h(cid:48) \u2208 E (or (h, r , t(cid:48)) for all t(cid:48) \u2208 E), they proposed to calculate the rank among triples\n(h(cid:48), r , t) only for h(cid:48) \u2208 E such that (h(cid:48), r , t) (cid:54)\u2208 train \u222a valid \u222a test. Following [4], we call this\nmeasure \ufb01ltered MRR. We also report hit@k measures. The hit@k for a model is computed as the\npercentage of test triples whose ranking (computed as described earlier) is less than or equal k.\nImplementation: We implemented SimplE in TensorFlow [1]. We tuned our hyper-parameters over\nthe validation set. We used the same search grid on embedding size and \u03bb as [39] to make our results\ndirectly comparable to their results. We \ufb01xed the maximum number of iterations to 1000 and the\nnumber of batches to 100. We set the learning rate for WN18 to 0.1 and for FB15k to 0.05 and used\nadagrad to update the learning rate after each batch. Following [39], we generated one negative\nexample per positive example for WN18 and 10 negative examples per positive example in FB15k.\nWe computed the \ufb01ltered MRR of our model over the validation set every 50 iterations for WN18\nand every 100 iterations for F B15k and selected the iteration that resulted in the best validation\n\ufb01ltered MRR. The best embedding size and \u03bb values on WN18 for SimplE-ignr were 200 and 0.001\nrespectively, and for SimplE were 200 and 0.03. The best embedding size and \u03bb values on FB15k for\nSimplE-ignr were 200 and 0.03 respectively, and for SimplE were 200 and 0.1.\n\n6.1 Entity Prediction Results\n\nTable 1 shows the results of our experiments. It can be viewed that both SimplE-ignr and SimplE do\na good job compared to the existing baselines on both datasets. On WN18, SimplE-ignr and SimplE\nperform as good as ComplEx, a state-of-the-art tensor factorization model. On FB15k, SimplE\noutperforms the existing baselines and gives state-of-the-art results among tensor factorization\napproaches. SimplE (and SimplE-ignr) work especially well on this dataset in terms of \ufb01ltered MRR\nand hit@1, so SimplE tends to do well at having its \ufb01rst prediction being correct.\nThe table shows that models with many parameters (e.g., NTN and STransE) do not perform well\non these datasets, as they probably over\ufb01t. Translational approaches generally have an inferior\nperformance compared to other approaches partly due to their representation restrictions mentioned\nin Proposition 2. As an example for the friendship relation in FB15k, if an entity e1 is friends\nwith 20 other entities and another entity e2 is friends with only one of those 20, then according to\nProposition 2 translational approaches force e2 to be friends with the other 19 entities as well (same\ngoes for, e.g., net\ufb02ix genre in FB15k and has part in WN18). The table also shows that bilinear\napproaches tend to have better performances compared to translational and deep learning approaches.\nEven DistMult, the simplest bilinear approach, outperforms many translational and deep learning\napproaches despite not being fully expressive. We believe the simplicity of embeddings and the\nscoring function is a key property for the success of SimplE.\n\n8\n\n\fTable 2: Background Knowledge Used in Section 6.2.\n\nRule\n\n(ei , hyponym, ej ) \u2208 \u03b6 \u21d4 (ej , hypernym, ei ) \u2208 \u03b6\n\n(ei , memberMeronym, ej ) \u2208 \u03b6 \u21d4 (ej , memberHolonym, ei ) \u2208 \u03b6\n(ei , instanceHyponym, ej ) \u2208 \u03b6 \u21d4 (ej , instanceHypernym, ei ) \u2208 \u03b6\n\n(ei , hasPart, ej ) \u2208 \u03b6 \u21d4 (ej , partOf , ei ) \u2208 \u03b6\n\n(ei , memberOfDomainTopic, ej ) \u2208 \u03b6 \u21d4 (ej , synsetDomainTopicOf , ei ) \u2208 \u03b6\n(ei , memberOfDomainUsage, ej ) \u2208 \u03b6 \u21d4 (ej , synsetDomainUsageOf , ei ) \u2208 \u03b6\n(ei , memberOfDomainRegion, ej ) \u2208 \u03b6 \u21d4 (ej , synsetDomainRegionOf , ei ) \u2208 \u03b6\n\n(ei , similarTo, ej ) \u2208 \u03b6 \u21d4 (ej , similarTo, ei ) \u2208 \u03b6\n\nRule Number\n\n1\n2\n3\n4\n5\n6\n7\n8\n\n6.2\n\nIncorporating background knowledge\n\nWhen background knowledge is available, we might expect that a knowledge graph might not\ninclude redundant information because it is implied by background knowledge and so the methods\nthat do not include the background knowledge can never learn it. In section 5.2, we showed how\nbackground knowledge that can be formulated in terms of three types of rules can be incorporated\ninto SimplE embeddings. To test this empirically, we conducted an experiment on WN18 in which\nwe incorporated several such rules into the embeddings as outlined in Propositions 3, 4, and 5. The\nrules can be found in Table 2. As can be viewed in Table 2, most of the rules are of the form\n\u2200ei, ej \u2208 E : (ei , r1 , ej ) \u2208 \u03b6 \u21d4 (ej , r2 , ei ) \u2208 \u03b6. For (possibly identical) relations such as r1 and r2\nparticipating in such a rule, if both (ei , r1 , ej ) and (ej , r2 , ei ) are in the training set, one of them\nis redundant because one can be inferred from the other. We removed redundant triples from the\ntraining set by randomly removing one of the two triples in the training set that could be inferred from\nthe other one based on the background rules. Removing redundant triples reduced the number of\ntriples in the training set from (approximately) 141K to (approximately) 90K, almost 36% reduction\nin size. Note that this experiment provides an upper bound on how much background knowledge can\nimprove the performance of a SimplE model.\nWe trained SimplE-ignr and SimplE (with tied parameters according to the rules) on this new training\ndataset with the best hyper-parameters found in the previous experiment. We refer to these two models\nas SimplE-ignr-bk and SimplE-bk. We also trained another SimplE-ignr and SimplE models on this\ndataset, but without incorporating the rules into the embeddings. For sanity check, we also trained a\nComplEx model over this new dataset. We found that the \ufb01ltered MRR for SimplE-ignr, SimplE, and\nComplEx were respectively 0.221, 0.384, and 0.275. For SimplE-ignr-bk and SimplE-bk, the \ufb01ltered\nMRRs were 0.772 and 0.776 respectively, substantially higher than the case without background\nknowledge. In terms of hit@k measures, SimplE-ignr gave 0.219, 0.220, and 0.224 for hit@1,\nhit@3 and hit@10 respectively. These numbers were 0.334, 0.404, and 0.482 for SimplE, and 0.254,\n0.280 and 0.313 for ComplEx. For SimplE-ignr-bk, these numbers were 0.715, 0.809 and 0.877 and\nfor SimplE-bk they were 0.715, 0.818 and 0.883, also substantially higher than the models without\nbackground knowledge. The obtained results validate that background knowledge can be effectively\nincorporated into SimplE embeddings to improve its performance.\n\n7 Conclusion\n\nWe proposed a simple interpretable fully expressive bilinear model for knowledge graph completion.\nWe showed that our model, called SimplE, performs very well empirically and has several interesting\nproperties. For instance, three types of background knowledge can be incorporated into SimplE by\ntying the embeddings. In future, SimplE could be improved or may help improve relational learning in\nseveral ways including: 1- building ensembles of SimplE models as [18] do it for DistMult, 2- adding\nSimplE to the relation-level ensembles of [44], 3- explicitly modelling the analogical structures\nof relations as in [23], 4- using [8]\u2019s 1-N scoring approach to generate many negative triples for a\npositive triple (Trouillon et al. [39] show that generating more negative triples improves accuracy),\n5- combining SimplE with symbolic approaches (e.g., with [19]) to improve property prediction, 6-\ncombining SimplE with (or use SimplE as a sub-component in) techniques from other categories of\nrelational learning as [33] do with ComplEx, 7- incorporating other types of background knowledge\n(e.g., entailment) into SimplE embeddings.\n\n9\n\n\fReferences\n[1] Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine\nlearning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a\ncollaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pages\n1247\u20131250. AcM, 2008.\n\n[3] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\n\nIrre\ufb02exive and hierarchical relations as translations. arXiv preprint arXiv:1304.7158, 2013.\n\n[4] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\n\nTranslating embeddings for modeling multi-relational data. In NIPS, pages 2787\u20132795, 2013.\n\n[5] George Cybenko. Approximations by superpositions of a sigmoidal function. Mathematics of\n\nControl, Signals and Systems, 2:183\u2013192, 1989.\n\n[6] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay\nKrishnamurthy, Alex Smola, and Andrew McCallum. Go for a walk and arrive at the answer:\nReasoning over paths in knowledge bases using reinforcement learning. NIPS Workshop on AKBC,\n2017.\n\n[7] Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical relational\narti\ufb01cial intelligence: Logic, probability, and computation. Synthesis Lectures on Arti\ufb01cial\nIntelligence and Machine Learning, 10(2):1\u2013189, 2016.\n\n[8] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d\n\nknowledge graph embeddings. In AAAI, 2018.\n\n[9] Boyang Ding, Quan Wang, Bin Wang, and Li Guo. Improving knowledge graph embedding\nIn Proceedings of the 56th Annual Meeting of the Association for\n\nusing simple constraints.\nComputational Linguistics, 2018.\n\n[10] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas\nStrohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic\nknowledge fusion. In ACM SIGKDD, pages 601\u2013610. ACM, 2014.\n\n[11] Jun Feng, Minlie Huang, Mingdong Wang, Mantong Zhou, Yu Hao, and Xiaoyan Zhu. Knowl-\n\nedge graph embedding by \ufb02exible translation. In KR, pages 557\u2013560, 2016.\n\n[12] Lise Getoor and Ben Taskar. Introduction to statistical relational learning. MIT press, 2007.\n\n[13] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, and Li Guo. Jointly embedding knowledge\ngraphs and logical rules. In Proceedings of the 2016 Conference on Empirical Methods in Natural\nLanguage Processing, pages 192\u2013202, 2016.\n\n[14] Katsuhiko Hayashi and Masashi Shimbo. On the equivalence of holographic and complex\n\nembeddings for link prediction. arXiv preprint arXiv:1702.05563, 2017.\n\n[15] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Studies in\n\nApplied Mathematics, 6(1-4):164\u2013189, 1927.\n\n[16] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks,\n\n4(2):251\u2013257, 1991.\n\n[17] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via\n\ndynamic mapping matrix. In ACL (1), pages 687\u2013696, 2015.\n\n[18] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselines\n\nstrike back. arXiv preprint arXiv:1705.10744, 2017.\n\n[19] Seyed Mehran Kazemi and David Poole. Relnn: A deep neural model for relational learning. In\n\nAAAI, 2018.\n\n10\n\n\f[20] Ni Lao and William W Cohen. Relational retrieval using a combination of path-constrained\n\nrandom walks. Machine learning, 81(1):53\u201367, 2010.\n\n[21] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modeling\n\nrelation paths for representation learning of knowledge bases. EMNLP, 2015.\n\n[22] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation\n\nembeddings for knowledge graph completion. In AAAI, pages 2181\u20132187, 2015.\n\n[23] Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multi-relational embed-\n\ndings. AAAI, 2018.\n\n[24] George A Miller. Wordnet: a lexical database for english. Communications of the ACM,\n\n38(11):39\u201341, 1995.\n\n[25] Pasquale Minervini, Luca Costabello, Emir Mu\u00f1oz, V\u00edt Nov\u00e1\u02c7cek, and Pierre-Yves Vanden-\nbussche. Regularizing knowledge graph embeddings via equivalence and inversion axioms. In\nJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages\n668\u2013683. Springer, 2017.\n\n[26] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. Stranse: a novel embedding\n\nmodel of entities and relationships in knowledge bases. In NAACL-HLT, 2016.\n\n[27] Dat Quoc Nguyen. An overview of embedding models of entities and relationships for knowl-\n\nedge base completion. arXiv preprint arXiv:1703.08098, 2017.\n\n[28] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\n\nlearning on multi-relational data. In ICML, volume 11, pages 809\u2013816, 2011.\n\n[29] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing yago: scalable machine\n\nlearning for linked data. In World Wide Web, pages 271\u2013280. ACM, 2012.\n\n[30] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factoriza-\n\ntion models by including observable patterns. In NIPS, pages 1179\u20131187, 2014.\n\n[31] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of\nrelational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11\u201333, 2016.\n\n[32] Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. Holographic embeddings of\n\nknowledge graphs. In AAAI, pages 1955\u20131961, 2016.\n\n[33] Tim Rockt\u00e4schel and Sebastian Riedel. End-to-end differentiable proving. In NIPS, pages\n\n3791\u20133803, 2017.\n\n[34] Tim Rockt\u00e4schel, Matko Bo\u0161njak, Sameer Singh, and Sebastian Riedel. Low-dimensional\nembeddings of logic. In Proceedings of the ACL 2014 Workshop on Semantic Parsing, pages\n45\u201349, 2014.\n\n[35] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In NIPS,\n2017.\n\n[36] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max\nWelling. Modeling relational data with graph convolutional networks. In European Semantic Web\nConference, pages 593\u2013607. Springer, 2018.\n\n[37] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural\n\ntensor networks for knowledge base completion. In NIPS, 2013.\n\n[38] Th\u00e9o Trouillon and Maximilian Nickel. Complex and holographic embeddings of knowledge\n\ngraphs: a comparison. arXiv preprint arXiv:1707.01475, 2017.\n\n[39] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard.\n\nComplex embeddings for simple link prediction. In ICML, pages 2071\u20132080, 2016.\n\n11\n\n\f[40] Th\u00e9o Trouillon, Christopher R Dance, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and\nGuillaume Bouchard. Knowledge graph completion via complex tensor factorization. arXiv\npreprint arXiv:1702.06879, 2017.\n\n[41] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by\n\ntranslating on hyperplanes. In AAAI, pages 1112\u20131119, 2014.\n\n[42] Quan Wang, Bin Wang, Li Guo, et al. Knowledge base completion using embeddings and rules.\n\nIn International Joint Conference on Arti\ufb01cial Intelligence, pages 1859\u20131866, 2015.\n\n[43] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A\nsurvey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering,\n29(12):2724\u20132743, 2017.\n\n[44] Yanjie Wang, Rainer Gemulla, and Hui Li. On multi-relational link prediction with bilinear\n\nmodels. AAAI, 2018.\n\n[45] Zhuoyu Wei, Jun Zhao, Kang Liu, Zhenyu Qi, Zhengya Sun, and Guanhua Tian. Large-scale\nknowledge base completion: Inferring via grounding network sampling over selected instances.\nIn Proceedings of the 24th ACM International on Conference on Information and Knowledge\nManagement, pages 1331\u20131340. ACM, 2015.\n\n[46] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and\n\nrelations for learning and inference in knowledge bases. ICLR, 2015.\n\n[47] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embed-\n\nding network for visual relation detection. In CVPR, volume 1, page 5, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2093, "authors": [{"given_name": "Seyed Mehran", "family_name": "Kazemi", "institution": "University of British Columbia"}, {"given_name": "David", "family_name": "Poole", "institution": "University of British Columbia"}]}