{"title": "Modelling Relational Data using Bayesian Clustered Tensor Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1821, "page_last": 1828, "abstract": "We consider the problem of learning probabilistic models for complex relational structures between various types of objects.  A model can help us ``understand a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true.  Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have better predictive performance on large data sets.  We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework.  Inference is fully Bayesian but scales well to large data sets.  The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.", "full_text": "Modelling Relational Data using Bayesian Clustered\n\nTensor Factorization\n\nIlya Sutskever\n\nUniversity of Toronto\n\nRuslan Salakhutdinov\n\nMIT\n\nJoshua B. Tenenbaum\n\nMIT\n\nilya@cs.utoronto.ca\n\nrsalakhu@mit.edu\n\njbt@mit.edu\n\nAbstract\n\nWe consider the problem of learning probabilistic models for complex relational\nstructures between various types of objects. A model can help us \u201cunderstand\u201d a\ndataset of relational facts in at least two ways, by \ufb01nding interpretable structure\nin the data, and by supporting predictions, or inferences about whether particular\nunobserved relations are likely to be true. Often there is a tradeoff between these\ntwo aims: cluster-based models yield more easily interpretable representations,\nwhile factorization-based approaches have given better predictive performance on\nlarge data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF)\nmodel, which embeds a factorized representation of relations in a nonparametric\nBayesian clustering framework.\nInference is fully Bayesian but scales well to\nlarge data sets. The model simultaneously discovers interpretable clusters and\nyields predictive performance that matches or beats previous probabilistic models\nfor relational data.\n\n1 Introduction\n\nLearning with relational data, or sets of propositions of the form (object, relation, object), has been\nimportant in a number of areas of AI and statistical data analysis. AI researchers have proposed that\nby storing enough everyday relational facts and generalizing appropriately to unobserved proposi-\ntions, we might capture the essence of human common sense. For instance, given propositions such\nas (cup, used-for, drinking), (cup, can-contain, juice), (cup, can-contain, water), (cup, can-contain,\ncoffee), (glass, can-contain, juice), (glass, can-contain, water), (glass, can-contain, wine), and so\non, we might also infer the propositions (glass, used-for, drinking), (glass, can-contain, coffee), and\n(cup, can-contain, wine). Modelling relational data is also important for more immediate appli-\ncations, including problems arising in social networks [2], bioinformatics [16], and collaborative\n\ufb01ltering [18].\n\nWe approach these problems using probabilistic models that de\ufb01ne a joint distribution over the truth\nvalues of all conceivable relations. Such a model de\ufb01nes a joint distribution over the binary variables\nT (a, r, b) \u2208 {0, 1}, where a and b are objects, r is a relation, and the variable T (a, r, b) determines\nwhether the relation (a, r, b) is true. Given a set of true relations S = {(a, r, b)}, the model predicts\nthat a new relation (a, r, b) is true with probability P (T (a, r, b) = 1|S).\nIn addition to making predictions on new relations, we also want to understand the data\u2014that is, to\n\ufb01nd a small set of interpretable laws that explains a large fraction of the observations. By introducing\nhidden variables over simple hypotheses, the posterior distribution over the hidden variables will\nconcentrate on the laws the data is likely to obey, while the nature of the laws depends on the model.\nFor example, the In\ufb01nite Relational Model (IRM) [8] represents simple laws consisting of partitions\nof objects and partitions of relations. To decide whether the relation (a, r, b) is valid, the IRM simply\nchecks that the clusters to which a, r, and b belong are compatible. The main advantage of the IRM\nis its ability to extract meaningful partitions of objects and relations from the observational data,\n\n1\n\n\fwhich greatly facilitates exploratory data analysis. More elaborate proposals consider models over\nmore powerful laws (e.g., \ufb01rst order formulas with noise models or multiple clusterings), which are\ncurrently less practical due to the computational dif\ufb01culty of their inference problems [7, 6, 9].\n\nModels based on matrix or tensor factorization [18, 19, 3] have the potential of making better predic-\ntions than interpretable models of similar complexity, as we demonstrate in our experimental results\nsection. Factorization models learn a distributed representation for each object and each relation,\nand make predictions by taking appropriate inner products. Their strength lies in the relative ease of\ntheir continuous (rather than discrete) optimization, and in their excellent predictive performance.\nHowever, it is often hard to understand and analyze the learned latent structure.\n\nThe tension between interpretability and predictive power is unfortunate: it is clearly better to have\na model that has both strong predictive power and interpretability. We address this problem by\nintroducing the Bayesian Clustered Tensor Factorization (BCTF) model, which combines good in-\nterpretability with excellent predictive power. Speci\ufb01cally, similarly to the IRM, the BCTF model\nlearns a partition of the objects and a partition of the relations, so that the truth-value of a relation\n(a, r, b) depends primarily on the compatibility of the clusters to which a, r, and b belong. At the\nsame time, every entity has a distributed representation: each object a is assigned the two vectors\naL, aR (one for a being a left argument in a relation and one for it being a right argument), and\na relation r is assigned the matrix R. Given the distributed representations, the truth of a relation\n(a, r, b) is determined by the value of a\u22a4\nRbR, while the object partition encourages the objects\nL\nwithin a cluster to have similar distributed representations (and similarly for relations).\n\nThe experiments show that the BCTF model achieves better predictive performance than a number\nof related probabilistic relational models, including the IRM, on several datasets. The model is scal-\nable, and we apply it on the Movielens [15] and the Conceptnet [10] datasets. We also examine\nthe structure found in BCTF\u2019s clusters and learned vectors. Finally, our results provide an exam-\nple where the performance of a Bayesian model substantially outperforms a corresponding MAP\nestimate for large sparse datasets with minimal manual hyperparameter selection.\n\n2 The Bayesian Clustered Tensor Factorization (BCTF)\n\nWe begin with a simple tensor factorization model. Suppose that we have a \ufb01xed \ufb01nite set of objects\nO and a \ufb01xed \ufb01nite set of relations R. For each object a \u2208 O the model maintains two vectors\naL, aR \u2208 Rd (the left and the right arguments of the relation), and for each relation r \u2208 R it\nmaintains a matrix R \u2208 Rd\u00d7d, where d is the dimensionality of the model. Given a setting of\nthese parameters (collectively denoted by \u03b8), the model independently chooses the truth-value of\neach relation (a, r, b) from the distribution P (T (a, r, b) = 1|\u03b8) = 1/(1 + exp(\u2212a\u22a4\nRbR)). In\nL\nparticular, given a set of known relations S, we can learn the parameters by maximizing a penalized\nlog likelihood log P (S|\u03b8) \u2212 Reg(\u03b8). The necessity of having a pair of parameters aL, aR, instead\nof a single distributed representation a, will become clear later.\nNext, we de\ufb01ne a prior over the vectors {aL}, {aR}, and {R}. Speci\ufb01cally, the model de\ufb01nes a\nprior distribution over partitions of objects and partitions of relations using the Chinese Restaurant\nProcess. Once the partitions are chosen, each cluster C samples its own prior mean and prior di-\nagonal covariance, which are then used to independently sample vectors {aL, aR : a \u2208 C} that\nbelong to cluster C (and similarly for the relations, where we treat R as a d2-dimensional vector).\nAs a result, objects within a cluster have similar distributed representations. When the clusters are\nRbR is mainly determined by the clusters to which a, r, and b\nsuf\ufb01ciently tight, the value of a\u22a4\nL\nbelong. At the same time, the distributed representations help generalization, because they can rep-\nresent graded similarities between clusters and \ufb01ne differences between objects in the same cluster.\nThus, given a set of relations, we expect the model to \ufb01nd both meaningful clusters of objects and\nrelations, as well as predictive distributed representations.\nMore formally, assume that O = {a1, . . . , aN } and R = {r1, . . . , rM }. The model is de\ufb01ned as\nfollows:\n\nP (obs, \u03b8, c, \u03b1, \u03b1DP ) = P (obs|\u03b8, \u03c32)P (\u03b8|c, \u03b1)P (c|\u03b1DP )P (\u03b1DP , \u03b1, \u03c32)\n\n(1)\nwhere the observed data obs is a set of triples and their truth values {(a, r, b), t}; the variable c =\n{cobj, crel} contains the cluster assignments (partitions) of the objects and the relations; the variable\n\u03b8 = {aL, aR, R} consists of the distributed representations of the objects and the relations, and\n\n2\n\n\fFigure 1: A schematic diagram of the model, where the arcs represent the object clusters and the\nvectors within each cluster are similar. The model predicts T (a, r, b) with a\u22a4\nL\n\nRbR.\n\n{\u03c32, \u03b1, \u03b1DP } are the model hyperparameters. Two of the above terms are given by\n\nP (obs|\u03b8) =\n\nY{(a,r,b),t}\u2208obs\n\nN (t|a\u22a4\nL\n\nRbR, \u03c32)\n\nP (c|\u03b1DP ) = CRP (cobj|\u03b1DP )CRP (crel|\u03b1DP )\n\n(2)\n\n(3)\n\nwhere N (t|\u00b5, \u03c32) denotes the Gaussian distribution with mean \u00b5 and variance \u03c32, and CRP (c|\u03b1)\ndenotes the probability of the partition induced by c under the Chinese Restaurant Process with\nconcentration parameter \u03b1. The Gaussian likelihood in Eq. 2 is far from ideal for modelling binary\ndata, but, similarly to [19, 18], we use it instead of the logistic function because it makes the model\nconjugate and Gibbs sampling easier.\nDe\ufb01ning P (\u03b8|c, \u03b1) takes a little more work. Given the partitions, the sets of parameters {aL}, {aR},\nand {R} become independent, so\n\nP (\u03b8|c, \u03b1) = P ({aL}|cobj, \u03b1obj)P ({aR}|cobj, \u03b1obj)P ({R}|crel, \u03b1rel)\n\nThe distribution over the relation-vectors is given by\n\nP ({R}|crel, \u03b1rel) =\n\n|crel|\n\nYk=1 Z\u00b5,\u03a3 Yi:crel,i=k\n\nN (Ri|\u00b5, \u03a3) dP (\u00b5, \u03a3|\u03b1rel)\n\n(4)\n\n(5)\n\nwhere |crel| is the number of clusters in the partition crel. This is precisely a Dirichlet process\nmixture model [13]. We further place a Gaussian-Inverse-Gamma prior over (\u00b5, \u03a3):\n\nP (\u00b5, \u03a3|\u03b1rel) = P (\u00b5|\u03a3)P (\u03a3|\u03b1rel) = N (\u00b5|0, \u03a3)Yd\u2032\n!Yd\u2032 (cid:0)\u03c32\n\n\u221d exp \u2212Xd\u2032\n\n\u03c32\nd\u2032\n\n\u00b52\n\nd\u2032/2 + 1\n\n(6)\n\n(7)\n\nIG(\u03c32\n\nd\u2032 |\u03b1rel, 1)\n\nd\u2032(cid:1)\u22120.5\u2212\u03b1rel\u22121\n\nwhere \u03a3 is a diagonal matrix whose entries are \u03c32\nd\u2032, the variable d\u2032 ranges over the dimensions of\nRi (so 1 \u2264 d\u2032 \u2264 d2), and IG(x|\u03b1, \u03b2) denotes the inverse-Gamma distribution with shape parameter\n\u03b1 and scale parameter \u03b2. This prior makes many useful expectations analytically computable. The\nterms P ({aL}|cobj, \u03b1obj) and P ({aR}|cobj, \u03b1obj) are de\ufb01ned analogously to Eq. 5.\n\nFinally, we place an improper P (x) \u221d x\u22121 scale-uniform prior over each hyperparameter indepen-\ndently.\n\nInference\n\nWe now brie\ufb02y describe the MCMC algorithm used for inference. Before starting the Markov chain,\nwe \ufb01nd a MAP estimate of the model parameters using the method of conjugate gradient (but we\ndo not optimize over the partitions). The MAP estimate is then used to initialize the Markov chain.\nEach step of the Markov chain consists of a number of internal steps. First, given the parameters\n\u03b8, the chain updates c = (crel, cobj) using a collapsed Gibbs sampling sweep and a step of the\nsplit-and-merge algorithm (where the launch state was obtained with two sweeps of Gibbs sampling\nstarting from a uniformly random cluster assignment) [5]. Next, it samples from the posterior mean\n\n3\n\n\fand covariance of each cluster, which is the distribution proportional to the term being integrated in\nEq. 5.\nNext, the Markov chain samples the parameters {aL} given {aR}, {R}, and the cluster posterior\nmeans and covariances. This step is tractable since the conditional distribution over the object vec-\ntors {aL} is Gaussian and factorizes into the product of conditional distributions over the individual\nobject vectors. This conditional independence is important, since it tends to make the Markov chain\nmix faster, and is a direct consequence of each object a having two vectors, aL and aR. If each\nobject a was only associated with a single vector a (and not aL, aR), the conditional distribution\nover {a} would not factorize, which in turn would require the use of a slower sequential Gibbs\nsampler. In the current setting, we can further speed up the inference by sampling from conditional\ndistributions in parallel. The speedup could be substantial, particularly when the number of objects\nis large. The disadvantage of using two vectors for each object is that the model cannot as easily\ncapture the \u201cposition-independent\u201d properties of the object, especially in the sparse regime.\nSampling {aL} from the Gaussian takes time proportional to d3 \u00b7 N , where N is the number of\nobjects. While we do the same for {aR}, we run a standard hybrid Monte Carlo to update the\nmatrices {R} using 10 leapfrog steps of size 10\u22125 [12]. Each matrix, which we treat as a vector,\nhas d2 dimensions, so direct sampling from the Gaussian distribution scales as d6 \u00b7 M , which is slow\neven for small values of d (e.g. 20). Finally, we make a small symmetric multiplicative change to\neach hyperparameter and accept or reject its new value according to the Metropolis-Hastings rule.\n\n3 Evaluation\n\nIn this section, we show that the BCTF model has excellent predictive power and that it \ufb01nds inter-\npretable clusters by applying it to \ufb01ve datasets and comparing its performance to the IRM [8] and\nthe Multiple Relational Clustering (MRC) model [9]. We also compare BCTF to its simpler counter-\npart: a Bayesian Tensor Factorization (BTF) model, where all the objects and the relations belong to\na single cluster. The Bayesian Tensor Factorization model is a generalization of the Bayesian prob-\nabilistic matrix factorization [17], and is closely related to many other existing tensor-factorization\nmethods [3, 14, 1]. In what follows, we will describe the datasets, report the predictive performance\nof our and of the competing algorithms, and examine the structure discovered by BCTF.\n\n3.1 Description of the Datasets\nWe use three of the four datasets used by [8] and [9], namely, the Animals, the UML, and the Kinship\ndataset, as well the Movielens [15] and the Conceptnet datasets [10].\n\n1. The animals dataset consists of 50 animals and 85 binary attributes. The dataset is a fully\n\nobserved matrix\u2014so there is only one relation.\n\n2. The kinship dataset consists of kinship relationships among the members of the Alyawarra\ntribe [4]. The dataset contains 104 people and 26 relations. This dataset is dense and has\n104\u00b726\u00b7104 = 218216 observations, most of which are 0.\n\n3. The UML dataset [11] consists of a 135 medical terms and 49 relations. The dataset is also\n\nfully observed and has 135\u00b749\u00b7135 = 893025 (mostly 0) observations.\n\n4. The Movielens [15] dataset consists of 1000209 observed integer ratings of 6041 movies\n\non a scale from 1 to 5, which are rated by 3953 users. The dataset is 95.8% sparse.\n\n5. The Conceptnet dataset [10] is a collection of common-sense assertions collected from the\nweb. It consists of about 112135 \u201ccommon-sense\u201d assertions such as (hockey, is-a, sport).\nThere are 19 relations and 17571 objects. To make our experiments faster, we used only\nthe 7000 most frequent objects, which resulted in 82062 true facts. For the negative data,\nwe sampled twice as many random object-relation-object triples and used them as the false\nfacts. As a result, there were 246186 binary observations in this dataset. The dataset is\n99.9% sparse.\n\n3.2 Experimental Protocol\n\nTo facilitate comparison with [9], we conducted our experiments the following way. First, we nor-\nmalized each dataset so the mean of its observations was 0. Next, we created 10 random train/test\n\n4\n\n\fanimals\n\nkinship\n\nUML\n\nmovielens\n\nconceptnet\n\nalgorithm RMSE AUC\n0.78\nMAP20\n0.68\nMAP40\n0.85\nBTF20\n0.86\nBCTF20\n0.86\nBTF40\n0.86\nBCTF40\n0.75\nIRM [8]\nMRC [9]\n0.81\n\n0.467\n0.528\n0.337\n0.331\n0.338\n0.336\n0.382\n\n\u2013\n\nRMSE AUC RMSE AUC RMSE AUC\n0.122\n0.110\n0.122\n0.122\n0.108\n0.108\n0.140\n\n0.033\n0.024\n0.033\n0.033\n0.024\n0.024\n0.054\n\n0.899\n0.933\n0.835\n0.836\n0.834\n0.836\n\n0.82\n0.90\n0.82\n0.82\n0.90\n0.90\n0.66\n0.85\n\n0.96\n0.98\n0.96\n0.96\n0.98\n0.98\n0.70\n0.98\n\nRMSE AUC\n0.57\n0.536\n0.48\n0.614\n0.93\n0.275\n0.278\n0.93\n0.94\n0.267\n0.260\n0.94\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\nTable 1: A quantitative evaluation of the algorithms using 20 and 40 dimensional vectors. We report the\nperformance of the following algorithms: the MAP-based Tensor Factorization, the Bayesian Tensor Factor-\nization (BTF) with MCMC (where all objects belong to a single cluster), the full Bayesian Clustered Tensor\nFactorization (BCTF), the IRM [8] and the MRC [9].\n\nF1 F2 F3\n\nO1\n\nO2\nO3\nO4\nO5\nO6\nO7\nO8\nO9\n\nF1\n\nF2\nF3\nF4\nF5\nF6\nF7\nF8\nF9\n\nkiller whale, blue whale, humpback whale,\nseal, walrus, dolphin\nantelope, dalmatian, horse, giraffe, zebra, deer\nmole, hamster, rabbit, mouse\nhippopotamus, elephant, rhinoceros\nspider monkey, gorilla, chimpanzee\nmoose, ox, sheep, buffalo, pig, cow\nbeaver, squirrel, otter\nPersian cat, skunk, chihuahua, collie\ngrizzly bear, polar bear\n\nO1\n\nO2\n\nO3\n\n\ufb02ippers, strainteeth, swims, \ufb01sh,\narctic, coastal, ocean, water\nhooves, vegetation, grazer, plains, \ufb01elds\npaws, claws, solitary\nbulbous, slow, inactive\njungle, tree\nbig, strong, group\nwalks, quadrapedal, ground\nsmall, weak, nocturnal, hibernate, nestspot\ntail, newworld, oldworld, timid\n\nF1\nF2\n\nO1\nO2\nO3\n\nFigure 2: Results on the Animals dataset. Left: The discovered clusters. Middle: The biclustering of the\nfeatures. Right: The covariance of the distributed representations of the animals (bottom) and their attributes\n(top).\n\nsplits, where 10% of the data was used for testing. For the Conceptnet and the Movielens datasets,\nwe used only two train/test splits and at most 30 clusters, which made our experiments faster. We\nreport test root mean squared error (RMSE) and the area under the precision recall curve (AUC) [9].\nFor the IRM1 we make predictions as follows. The IRM partitions the data into blocks; we compute\nthe smoothed mean of the observed entries of each block and use it to predict the test entries in the\nsame block.\n\n3.3 Results\n\nWe \ufb01rst applied BCTF to the Animals, Kinship, and the UML datasets using 20 and 40-dimensional\nvectors. Table 1 shows that BCTF substantially outperforms IRM and MRC in terms of both RMSE\nand AUC. In fact, for the Kinship and the UML datasets, the simple tensor factorization model\ntrained by MAP performs as well as BTF and BCTF. This happens because for these datasets the\nnumber of observations is much larger than the number of parameters, so there is little uncertainty\nabout the true parameter values. However, the Animals dataset is considerably smaller, so BTF\nperforms better, and BCTF performs even better than the BTF model.\n\nWe then applied BCTF to the Movielens and the Conceptnet datasets. We found that the MAP es-\ntimates suffered from signi\ufb01cant over\ufb01tting, and that the fully Bayesian models performed much\nbetter. This is important because both datasets are sparse, which makes over\ufb01tting dif\ufb01cult to com-\nbat. For the extremely sparse Conceptnet dataset, the BCTF model further improved upon simpler\n\n1The code is available at http://www.psy.cmu.edu/\u02dcckemp/code/irm.html\n\n5\n\n\fa)\n\nb)\n\nc)\n\nd)\n\ne)\n\nf)\n\ng)\n\nFigure 3: Results on the Kinship dataset. Left: The covariance of the distributed representations {aL} learned\nfor each person. Right: The biclustering of a subset of the relations.\n\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n\nAmino Acid, Peptide, or Protein, Biomedical or Dental Material, Carbohydrate, . . .\nAmphibian, Animal, Archaeon, Bird, Fish, Human, . . .\nAntibiotic, Biologically Active Substance, Enzyme, Hazardous or Poisonous Substance, Hormone, . . .\nBiologic Function, Cell Function, Genetic Function, Mental Process, . . .\nClassi\ufb01cation, Drug Delivery Device, Intellectual Product, Manufactured Object, . . .\nBody Part, Organ, Cell, Cell Component, . . .\nAlga, Bacterium, Fungus, Plant, Rickettsia or Chlamydia, Virus\nAge Group, Family Group, Group, Patient or Disabled Group, . . .\nCell / Molecular Dysfunction, Disease or Syndrome, Model of Disease, Mental Dysfunction, . . .\nDaily or Recreational Activity, Educational Activity, Governmental Activity, . . .\nEnvironmental Effect of Humans, Human-caused Phenomenon or Process, . . .\nAcquired Abnormality, Anatomical Abnormality, Congenital Abnormality, Injury or Poisoning\nHealth Care Related Organization, Organization, Professional Society, . . .\n\nAffects\n\ninteracts with\n\ncauses\n\nFigure 4: Results on the medical UML dataset. Left: The covariance of the distributed representations {aL}\nlearned for each object. Right: The inferred clusters, along with the biclustering of a subset of the relations.\n\nBTF model. We do not report results for the IRM, because the existing off-the-shelf implementation\ncould not handle these large datasets.\n\nWe now examine the latent structure discovered by the BCTF model by inspecting a sample pro-\nduced by the Markov chain. Figure 2 shows some of the clusters learned by the model on the\nAnimals dataset. It also shows the biclustering, as well as the covariance of the distributed repre-\nsentations of the animals and their attributes, sorted by their clusters. By inspecting the covariance,\nwe can determine the clusters that are tight and the af\ufb01nities between the clusters. Indeed, the clus-\nter structure is re\ufb02ected in the block-diagonal structure of the covariance matrix. For example, the\ncovariance of the attributes (see Fig. 2, top-right panel) shows that cluster F1, containing {\ufb02ippers,\nstainteeth,swims} is similar to cluster F4, containing {bulbous, slow, inactive}, but is very dissimilar\nto F2, containing {hooves, vegetation, grazer}.\nFigure 3 displays the learned representation for the Kinship dataset. The kinship dataset has 104\npeople with complex relationships between them: each person belongs to one of four sections,\nwhich strongly constrains the other relations. For example, a person in section 1 has a father in\nsection 3 and a mother in section 4 (see [8, 4] for more details). After learning, each cluster was\nalmost completely localized in gender, section, and age. For clarity of presentation, we sort the\nclusters \ufb01rst by their section, then by their gender, and \ufb01nally by their age, as done in [8]. Figure 3\n(panels (b-g)) displays some of the relations according to this clustering, and panel (a) shows the\ncovariance between the vectors {aL} learned for each person. The four sections are clearly visible\nin the covariance structure of the distributed representations.\nFigure 4 shows the inferred clusters for the medical UML dataset. For example, the model discovers\nthat {Amino Acid, Peptide, Protein} Affects {Biologic Function, Cell Function, Genetic Function},\n\n6\n\n\f1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n\nIndependence Day; Lost World: Jurassic Park The; Stargate; Twister; Air Force One; . . .\nStar Wars: Episode IV - A New Hope; Silence of the Lambs The; Raiders of the Lost Ark; . . .\nShakespeare in Love; Shawshank Redemption The; Good Will Hunting; As Good As It Gets; . . .\nFargo; Being John Malkovich; Annie Hall; Talented Mr. Ripley The; Taxi Driver; . . .\nE.T. the Extra-Terrestrial; Ghostbusters; Babe; Bug\u2019s Life A; Toy Story 2; . . .\nJurassic Park; Saving Private Ryan; Matrix The; Back to the Future; Forrest Gump; . . .\nDick Tracy; Space Jam; Teenage Mutant Ninja Turtles; Superman III; Last Action Hero; . . .\nMonty Python and the Holy Grail; Twelve Monkeys; Beetlejuice; Ferris Bueller\u2019s Day Off; . . .\nLawnmower Man The; Event Horizon; Howard the Duck; Beach The; Rocky III; Bird on a Wire; . . .\nTerminator 2: Judgment Day; Terminator The; Alien; Total Recall; Aliens; Jaws; Predator; . . .\nGroundhog Day; Who Framed Roger Rabbit?; Usual Suspects The; Airplane!; Election; . . .\nBack to the Future Part III; Honey I Shrunk the Kids; Crocodile Dundee; Rocketeer The; . . .\nSixth Sense The; Braveheart; Princess Bride The; Batman; Willy Wonka and the Chocolate Factory; . . .\nMen in Black; Galaxy Quest; Clueless; Chicken Run; Mask The; Pleasantville; Mars Attacks!; . . .\nAustin Powers: The Spy Who Shagged Me; There\u2019s Something About Mary; Austin Powers: . . .\nBreakfast Club The; American Pie; Blues Brothers The; Animal House; Rocky; Blazing Saddles; . . .\nAmerican Beauty; Pulp Fiction; GoodFellas; Fight Club; South Park: Bigger Longer and Uncut; . . .\nStar Wars: Episode V - The Empire Strikes Back; Star Wars: Episode VI - Return of the Jedi; . . .\nEdward Scissorhands; Blair Witch Project The; Nightmare Before Christmas The; James and the Giant Peach; . . .\nMighty Peking Man\n\nFigure 5: Results on the Movielens dataset. Left: The covariance between the movie vectors. Right: The\ninferred clusters.\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n\nfeel good; make money; make music; sweat; earn money; check your mind; pass time;\nweasel; Apple trees; Ferrets; heifer; beaver; \ufb01cus; anemone; blow\ufb01sh; koala; triangle;\nboredom; anger; cry; buy ticket; laughter; fatigue; joy; panic; turn on tv; patience;\nenjoy; danger; hurt; bad; competition; cold; recreate; bored; health; excited;\ncar; book; home; build; store; school; table; of\ufb01ce; music; desk; cabinet; pleasure;\nlibrary; New York; shelf; cupboard; living room; pocket; a countryside; utah; basement;\ncity; bathroom; kitchen; restaurant; bed; park; refrigerate; closet; street; bedroom;\nthink; sleep; sit; play games; examine; listen music; read books; buy; wait; play sport;\nHousework; attend class; go jogging; chat with friends; visit museums; ride bikes;\nfox; small dogs; wiener dog; bald eagle; crab; boy; bee; monkey; shark; sloth; marmot;\nfun; relax; entertain; learn; eat; exercise; sex; food; work; talk; play; party; travel;\nstate; a large city; act; big city; Europe; maryland; colour; corner; need; pennsylvania;\nplay music; go; look; drink water; cut; plan; rope; fair; chew; wear; body part; fail;\ngreen; lawyer; recycle; globe; Rat; sharp points; silver; empty; Bob Dylan; dead \ufb01sh;\npotato; comfort; knowledge; move; inform; burn; men; vegetate; fear; accident; murder;\ngarbage; thought; orange; handle; penis; diamond; wing; queen; nose; sidewalk; pad;\nsand; bacteria; robot; hall; basketball court; support; Milky Way; chef; sheet of paper;\ndessert; pub; extinguish \ufb01re; fuel; symbol; cleanliness; lock the door; shelter; sphere;\n\nFigure 6: Results on the Conceptnet dataset. Left: The covariance of the learned {aL} vectors for each object.\nRight: The inferred clusters.\n\nwhich is also similar, according to the covariance, to {Cell Dysfunction, Disease, Mental Dysfunc-\ntion}. Qualitatively, the clustering appears to be on par with that of the IRM on all the datasets, but\nthe BCTF model is able to predict held-out relations much better.\nFigures 5 and 6 display the learned clusters for the Movielens and the Conceptnet datasets. For the\nMovielens dataset, we show the most frequently-rated movies in each cluster where the clusters are\nsorted by size. We also show the covariance between the movie vectors which are sorted by the\nclusters, where we display only the 100 most frequently-rated movies per cluster. The covariance\nmatrix is aligned with the table on the right, making it easy to see how the clusters relate to each\nother. For example, according to the covariance structure, clusters 7 and 9, containing Hollywood\naction/adventure movies are similar to each other but are dissimilar to cluster 8, which consists of\ncomedy/horror movies.\nFor the Conceptnet dataset, Fig. 6 displays the 100 most frequent objects per category. From the co-\nvariance matrix, we can infer that clusters 8, 9, and 11, containing concepts associated with humans\ntaking actions, are very similar to each other, and are very dissimilar to cluster 10, which contains\nanimals. Observe that some clusters (e.g., clusters 2-6) are not crisp, which is re\ufb02ected in the smaller\ncovariances between vectors in each of these clusters.\n\n4 Discussions and Conclusions\nWe introduced a new method for modelling relational data which is able to both discover meaningful\nstructure and generalize well. In particular, our results illustrate the predictive power of distributed\nrepresentations when applied to modelling relational data, since even simple tensor factorization\nmodels can sometimes outperform the more complex models. Indeed, for the kinship and the UML\ndatasets, the performance of the MAP-based tensor factorization was as good as the performance\nof the BCTF model, which is due to the density of these datasets: the number of observations was\nmuch larger than the number of parameters. On the other hand, for large sparse datasets, the BCTF\n\n7\n\n\fmodel signi\ufb01cantly outperformed its MAP counterpart, and in particular, it noticeably outperformed\nBTF on the Conceptnet dataset.\n\nA surprising aspect of the Bayesian model is the ease with which it worked after automatic hy-\nperparameter selection was implemented. Furthermore, the model performs well even when the\ninitial MAP estimate is very poor, as was the case for the 40-dimensional models on the Conceptnet\ndataset. This is particularly important for large sparse datasets, since \ufb01nding a good MAP estimate\nrequires careful cross-validation to select the regularization hyperparameters. Careful hyperparam-\neter selection can be very labour-expensive because it requires careful training of a large number of\nmodels.\n\nAcknowledgments\nThe authors acknowledge the \ufb01nancial support from NSERC, Shell, NTT Communication Sciences\nLaboratory, AFOSR FA9550-07-1-0075, and AFOSR MURI.\n\nReferences\n[1] Edoardo Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic\n\nblockmodels. In NIPS, pages 33\u201340. MIT Press, 2008.\n\n[2] P.J. Carrington, J. Scott, and S. Wasserman. Models and methods in social network analysis. Cambridge\n\nUniversity Press, 2005.\n\n[3] W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Proceed-\n\nings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, 2009.\n\n[4] W. Denham. The Detection of Patterns in Alyawarra Nonverbal Behavior. PhD thesis, Department of\n\nAnthropology, University of Washington, 1973.\n\n[5] S. Jain and R.M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process\n\nmixture model. Journal of Computational and Graphical Statistics, 13(1):158\u2013182, 2004.\n\n[6] Y. Katz, N.D. Goodman, K. Kersting, C. Kemp, and J.B. Tenenbaum. Modeling Semantic Cognition as\nLogical Dimensionality Reduction. In Proceedings of Thirtieth Annual Meeting of the Cognitive Science\nSociety, 2008.\n\n[7] C. Kemp, N.D. Goodman, and J.B. Tenenbaum. Theory acquisition and the language of thought.\n\nProceedings of Thirtieth Annual Meeting of the Cognitive Science Society, 2008.\n\nIn\n\n[8] C. Kemp, J.B. Tenenbaum, T.L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with an\nin\ufb01nite relational model. In Proceedings of the National Conference on Arti\ufb01cial Intelligence, volume 21,\npage 381. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.\n\n[9] S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th international confer-\n\nence on Machine learning, pages 433\u2013440. ACM New York, NY, USA, 2007.\n\n[10] H. Liu and P. Singh. ConceptNeta practical commonsense reasoning tool-kit. BT Technology Journal,\n\n22(4):211\u2013226, 2004.\n\n[11] A.T. McCray. An upper-level ontology for the biomedical domain. Comparative and Functional Ge-\n\nnomics, 4(1):80\u201384, 2003.\n\n[12] R.M. Neal. Probabilistic inference using Markov chain Monte Carlo methods, 1993.\n[13] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of computa-\n\ntional and graphical statistics, pages 249\u2013265, 2000.\n\n[14] Ian Porteous, Evgeniy Bart, and Max Welling. Multi-HDP: A non parametric bayesian model for tensor\n\nfactorization. In Dieter Fox and Carla P. Gomes, editors, AAAI, pages 1487\u20131490. AAAI Press, 2008.\n\n[15] J. Riedl, J. Konstan, S. Lam, and J. Herlocker. Movielens collaborative \ufb01ltering data set, 2006.\n[16] J.F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G.F. Berriz, F.D. Gibbons,\nM. Dreze, N. Ayivi-Guedehoussou, et al. Towards a proteome-scale map of the human protein\u2013protein\ninteraction network. Nature, 437(7062):1173\u20131178, 2005.\n\n[17] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte\nCarlo. In Proceedings of the 25th international conference on Machine learning, pages 880\u2013887. ACM\nNew York, NY, USA, 2008.\n\n[18] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. Advances in neural information pro-\n\ncessing systems, 20, 2008.\n\n[19] R. Speer, C. Havasi, and H. Lieberman. AnalogySpace: Reducing the dimensionality of common sense\n\nknowledge. In Proceedings of AAAI, 2008.\n\n8\n\n\f", "award": [], "sourceid": 897, "authors": [{"given_name": "Ilya", "family_name": "Sutskever", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}