{"title": "On the Model Shrinkage Effect of Gamma Process Edge Partition Models", "book": "Advances in Neural Information Processing Systems", "page_first": 397, "page_last": 405, "abstract": "The edge partition model (EPM) is a fundamental Bayesian nonparametric model for extracting an overlapping structure from binary matrix. The EPM adopts a gamma process ($\\Gamma$P) prior to automatically shrink the number of active atoms. However, we empirically found that the model shrinkage of the EPM does not typically work appropriately and leads to an overfitted solution. An analysis of the expectation of the EPM's intensity function suggested that the gamma priors for the EPM hyperparameters disturb the model shrinkage effect of the internal $\\Gamma$P. In order to ensure that the model shrinkage effect of the EPM works in an appropriate manner, we proposed two novel generative constructions of the EPM: CEPM incorporating constrained gamma priors, and DEPM incorporating Dirichlet priors instead of the gamma priors. Furthermore, all DEPM's model parameters including the infinite atoms of the $\\Gamma$P prior could be marginalized out, and thus it was possible to derive a truly infinite DEPM (IDEPM) that can be efficiently inferred using a collapsed Gibbs sampler. We experimentally confirmed that the model shrinkage of the proposed models works well and that the IDEPM indicated state-of-the-art performance in generalization ability, link prediction accuracy, mixing efficiency, and convergence speed.", "full_text": "On the Model Shrinkage E\ufb00ect of\n\nGamma Process Edge Partition Models\n\nIku Ohama\u22c6\u2021\n\u22c6Panasonic Corp., Japan \u2020The Univ. of Tokyo, Japan \u2021Hokkaido Univ., Japan\n\nHiroki Arimura\u2021\n\nIssei Sato\u2020\n\nTakuya Kida\u2021\n\nohama.iku@jp.panasonic.com sato@k.u-tokyo.ac.jp\n\n{kida,arim}@ist.hokudai.ac.jp\n\nAbstract\n\nThe edge partition model (EPM) is a fundamental Bayesian nonparamet-\nric model for extracting an overlapping structure from binary matrix. The\nEPM adopts a gamma process (\u0393P) prior to automatically shrink the num-\nber of active atoms. However, we empirically found that the model shrink-\nage of the EPM does not typically work appropriately and leads to an\nover\ufb01tted solution. An analysis of the expectation of the EPM\u2019s intensity\nfunction suggested that the gamma priors for the EPM hyperparameters\ndisturb the model shrinkage e\ufb00ect of the internal \u0393P. In order to ensure that\nthe model shrinkage e\ufb00ect of the EPM works in an appropriate manner, we\nproposed two novel generative constructions of the EPM: CEPM incorpo-\nrating constrained gamma priors, and DEPM incorporating Dirichlet priors\ninstead of the gamma priors. Furthermore, all DEPM\u2019s model parameters\nincluding the in\ufb01nite atoms of the \u0393P prior could be marginalized out, and\nthus it was possible to derive a truly in\ufb01nite DEPM (IDEPM) that can\nbe e\ufb03ciently inferred using a collapsed Gibbs sampler. We experimentally\ncon\ufb01rmed that the model shrinkage of the proposed models works well and\nthat the IDEPM indicated state-of-the-art performance in generalization\nability, link prediction accuracy, mixing e\ufb03ciency, and convergence speed.\n\n1 Introduction\n\nDiscovering low-dimensional structure from a binary matrix is an important problem in\nrelational data analysis. Bayesian nonparametric priors, such as Dirichlet process (DP) [1]\nand hierarchical Dirichlet process (HDP) [2], have been widely applied to construct statistical\nmodels with an automatic model shrinkage e\ufb00ect [3, 4]. Recently, more advanced stochastic\nprocesses such as the Indian bu\ufb00et process (IBP) [5] enabled the construction of statistical\nmodels for discovering overlapping structures [6, 7], wherein each individual in a data matrix\ncan belong to multiple latent classes.\n\nAmong these models, the edge partition model (EPM) [8] is a fundamental Bayesian nonpara-\nmetric model for extracting overlapping latent structure underlying a given binary matrix.\nThe EPM considers latent positive random counts for only non-zero entries in a given binary\nmatrix and factorizes the count matrix into two non-negative matrices and a non-negative\ndiagonal matrix. A link probability of the EPM for an entry is de\ufb01ned by transforming\nthe multiplication of the non-negative matrices into a probability, and thus the EPM can\ncapture overlapping structures with a noisy-OR manner [6]. By incorporating a gamma\nprocess (\u0393P) as a prior for the diagonal matrix, the number of active atoms of the EPM\nshrinks automatically according to the given data. Furthermore, by truncating the in\ufb01nite\natoms of the \u0393P with a \ufb01nite number, all parameters and hyperparameters of the EPM\ncan be inferred using closed-form Gibbs sampler. Although, the EPM is well designed to\ncapture an overlapping structure and has an attractive a\ufb03nity with a closed-form posterior\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe EPM \nextracted many \nunexpected \nlatent classes. \n(98 active \nclasses)\n\nThe proposed \nIDEPM \nsuccessfully \nfound expected \n5 overlapped \nlatent classes.\n\n(a) Synthetic data\n\n(b) EPM solution\n\n(c) Proposed IDEPM solution\n\nFigure 1: (Best viewed in color.) A synthetic example: (a) synthetic 90 \u00d7 90 data (white\ncorresponds to one, and black to zero); (b) EPM solution; and (c) the proposed IDEPM solu-\ntion. In (b) and (c), non-zero entries are colored to indicate their most probable assignment\nto the latent classes.\n\ninference, the EPM involves a critical drawback in its model shrinkage mechanism. As we\nexperimentally show in Sec. 5, we found that the model shrinkage e\ufb00ect of the EPM does\nnot typically work in an appropriate manner. Figure 1 shows a synthetic example. As shown\nin Fig. 1a, there are \ufb01ve overlapping latent classes (white blocks). However, as shown in\nFig. 1b, the EPM overestimates the number of active atoms (classes) and over\ufb01ts the data.\n\nIn this paper, we analyze the undesired property of the EPM\u2019s model shrinkage mechanism\nand propose novel generative constructions for the EPM to overcome the aforementioned\ndisadvantage. As shown in Fig. 1c, the IDEPM proposed in this paper successfully shrinks\nunnecessary atoms. More speci\ufb01cally, we have three major contributions in this paper.\n\n(1) We analyse the generative construction of the EPM and \ufb01nd a property that disturbs its\nmodel shrinkage e\ufb00ect (Sec. 3). We derive the expectation of the EPM\u2019s intensity function\n(Theorem 1), which is the total sum of the in\ufb01nite atoms for an entry. From the derived\nexpectation, we obtain a new \ufb01nding that gamma priors for the EPM\u2019s hyperparameters\ndisturb the model shrinkage e\ufb00ect of the internal \u0393P (Theorem 2). That is, the derived\nexpectation is expressed by a multiplication of the terms related to \u0393P and other gamma\npriors. Thus, there is no guarantee that the expected number of active atoms is \ufb01nite.\n\n(2) Based on the analysis of the EPM\u2019s intensity function, we propose two novel construc-\ntions of the EPM: the CEPM incorporating constrained gamma priors (Sec. 4.1) and the\nDEPM incorporating Dirichlet priors instead of the gamma priors (Sec. 4.2). The model\nshrinkage e\ufb00ect of the CEPM and DEPM works appropriately because the expectation of\ntheir intensity functions depends only on the \u0393P prior (Sec. 4.1 and Theorem 3 in Sec. 4.2).\n\n(3) Furthermore, for the DEPM, all model parameters, including the in\ufb01nite atoms of the\n\u0393P prior, can be marginalized out (Theorem 4). Therefore, we can derive a truly in\ufb01nite\nDEPM (IDEPM), which has a closed-form marginal likelihood without truncating in\ufb01nite\natoms, and can be e\ufb03ciently inferred using collapsed Gibbs sampler [9] (Sec. 4.3).\n\n2 The Edge Partition Model (EPM)\n\nIn this section, we review the EPM [8] as a baseline model. Let x be an I \u00d7 J binary matrix,\nwhere an entry between i-th row and j-th column is represented by xi,j \u2208 {0, 1}. In order to\nextract an overlapping structure underlying x, the EPM [8] considers a non-negative matrix\nfactorization problem on latent Poisson counts as follows:\n\nxi,j = I(mi,j,\u00b7 \u2265 1), mi,j,\u00b7 | U , V , \u03bb \u223c Poisson KXk=1\n\nUi,kVj,k\u03bbk! ,\n\n(1)\n\nwhere U and V are I \u00d7 K and J \u00d7 K non-negative matrices, respectively, and \u03bb is a K \u00d7 K\nnon-negative diagonal matrix. Note that I(\u00b7) is 1 if the predicate holds and is zero otherwise.\nThe latent counts m take positive values only for edges (non-zero entries) within a given\nbinary matrix and the generative model for each positive count is equivalently expressed as\n\na sum of K Poisson random variables as mi,j,\u00b7 = Pk mi,j,k, mi,j,k \u223c Poisson(Ui,kVj,k\u03bbk).\n\nThis is the reason why the above model is called edge partition model. Marginalizing\nm out from Eq. (1), the generative model of the EPM can be equivalently rewritten as\n\n2\n\n\fxi,j | U , V , \u03bb \u223c Bernoulli(1 \u2212Qk e\u2212Ui,kVj,k\u03bbk ). As e\u2212Ui,kVj,k\u03bbk \u2208 [0, 1] denotes the probabil-\n\nity that a Poisson random variable with mean Ui,kVj,k\u03bbk corresponds to zero, the EPM can\ncapture an overlapping structure with a noisy-OR manner [6].\n\nIn order to complete the Bayesian hierarchical model of the EPM, gamma priors are adopted\nas Ui,k \u223c Gamma(a1, b1) and Vj,k \u223c Gamma(a2, b2), where a1, a2 are shape parameters and\nb1, b2 are rate parameters for the gamma distribution, respectively. Furthermore, a gamma\nprocess (\u0393P) is incorporated as a Bayesian nonparametric prior for \u03bb to make the EPM\nautomatically shrink its number of atoms K. Let Gamma(\u03b30/T, c0) denote a truncated \u0393P\nwith a concentration parameter \u03b30 and a rate parameter c0, where T denotes a truncation\nlevel that should be set large enough to ensure a good approximation to the true \u0393P. Then,\nthe diagonal elements of \u03bb are drawn as \u03bbk \u223c Gamma(\u03b30/T, c0) for k \u2208 {1, . . . , T }.\n\nThe posterior inference for all parameters and hyperparameters of the EPM can be per-\nformed using Gibbs sampler (detailed in Appendix A). Thanks to the conjugacy between\n\ngamma and Poisson distributions, given mi,\u00b7,k =Pj mi,j,k and m\u00b7,j,k =Pi mi,j,k, posterior\n\nsampling for Ui,k and Vj,k is straightforward. As the \u0393P prior is approximated by a gamma\ndistribution, posterior sampling for \u03bbk also can be performed straightforwardly. Given U ,\nV , and \u03bb, posterior sample for mi,j,\u00b7 can be simulated using zero-truncated Poisson (ZTP)\ndistribution [10]. Finally, we can obtain su\ufb03cient statistics mi,j,k by partitioning mi,j,\u00b7 into\nT atoms using a multinomial distribution. Furthermore, all hyperparameters of the EPM\n(i.e., \u03b30, c0, a1, a2, b1, and b2) can also be sampled by assuming a gamma hyper prior\nGamma(e0, f0). Thanks to the conjugacy between gamma distributions, posterior sampling\nfor c0, b1, and b2 is straightforward. For the remaining hyperparameters, we can construct\nclosed-form Gibbs samplers using data augmentation techniques [11, 12, 2].\n\n3 Analysis for Model Shrinkage Mechanism\n\nThe EPM is well designed to capture an overlapping structure with a simple Gibbs inference.\nHowever, the EPM involves a critical drawback in its model shrinkage mechanism.\n\nFor the EPM, a \u0393P prior is incorporated as a prior for the non-negative diagonal matrix as\n\u03bbk \u223c Gamma(\u03b30/T, c0). From the form of the truncated \u0393P, thanks to the additive property\nof independent gamma random variables, the total sum of \u03bbk over countably in\ufb01nite atoms\nk=1 \u03bbk \u223c Gamma(\u03b30, c0), wherein the intensity function of\n. Therefore, the \u0393P has a regularization\nmechanism that automatically shrinks the number of atoms according to given observations.\n\nfollows a gamma distribution asP\u221e\nthe \u0393P has a \ufb01nite expectation as E[P\u221e\n\nk=1 \u03bbk] = \u03b30\nc0\n\nHowever, as experimentally shown in Sec. 5, the model shrinkage mechanism of the EPM\ndoes not work appropriately. More speci\ufb01cally, the EPM often overestimates the number of\nactive atoms and over\ufb01ts the data. Thus, we analyse the intensity function of the EPM to\nreveal the reason why the model shrinkage mechanism does not work appropriately.\n\nTheorem 1. The expectation of the EPM\u2019s intensity functionP\u221e\n\n(i, j) is \ufb01nite and can be expressed as follows:\n\nk=1 Ui,kVj,k\u03bbk for an entry\n\nE\" \u221eXk=1\n\nUi,kVj,k\u03bbk# =\n\na1\nb1\n\n\u00d7\n\na2\nb2\n\n\u00d7\n\n\u03b30\nc0\n\n.\n\n(2)\n\nProof. As U , V , and \u03bb are independent of each other, the expected value operator is mul-\ntiplicative for the EPM\u2019s intensity function. Using the multiplicativity and the low of total\nE[Ui,k]E[Vj,k]E[\u03bbk] =\n\nexpectation, the proof is completed as E [P\u221e\n\nk=1 \u03bbk].\n\nk=1 Ui,kVj,k\u03bbk] = P\u221e\n\na1\nb1\n\n\u00d7 a2\nb2\n\nk=1\n\n\u00d7 E[P\u221e\n\nAs Eq. (2) in Theorem 1 shows, the expectation of the EPM\u2019s intensity function is expressed\nby multiplying individual expectations of a \u0393P and two gamma distributions. This causes\nan undesirable property to the model shrinkage e\ufb00ect of the EPM. From Theorem 1, another\nimportant theorem about the EPM\u2019s model shrinkage e\ufb00ect is obtained as follows:\n\n3\n\n\fk=1 Ui,kVj,k\u03bbk] = C for Eq. (2), we obtain C = a1\nb1\n\n\u00d7 a2\nb2\n\n\u00d7 \u03b30\nc0\n\nin which the model shrinkage e\ufb00ect of the \u0393P prior disappears.\n\nTheorem 2. Given an arbitrary non-negative constant C, even if the expectation of the\nk=1 Ui,kVj,k\u03bbk] = C, there exist cases\n\n. Since\na1, a2, b1, and b2 are gamma random variables, even if the expectation of the EPM\u2019s intensity\nfunction, C, is \ufb01xed, \u03b30\n\u00d7 \u03b30\nc0\nc0\n\nEPM\u2019s intensity function in Eq. (2) is \ufb01xed as E [P\u221e\nProof. Substituting E [P\u221e\nholds. Hence, \u03b30 can take an arbitrary large value such that \u03b30 = T \u00d7b\u03b30. This implies that\nas \u03bbk \u223c Gamma(\u03b30/T, c0) = Gamma(b\u03b30, c0).\n\nTheorem 2 indicates that the EPM might overestimate the number of active atoms, and\nlead to over\ufb01tted solutions.\n\nthe \u0393P prior for the EPM degrades to a gamma distribution without model shrinkage e\ufb00ect\n\ncan take an arbitrary value so that equation C = a1\nb1\n\n\u00d7 a2\nb2\n\n4 Proposed Generative Constructions\n\nWe describe our novel generative constructions for the EPM with an appropriate model\nshrinkage e\ufb00ect. According to the analysis described in Sec. 3, the model shrinkage mech-\nanism of the EPM does not work because the expectation of the EPM\u2019s intensity function\nhas an undesirable redundancy. This \ufb01nding motivates the proposal of new generative con-\nstructions, in which the expectation of the intensity function depends only on the \u0393P prior.\n\nFirst, we propose a naive extension of the original EPM using constrained gamma priors\n(termed as CEPM). Next, we propose an another generative construction for the EPM by\nincorporating Dirichlet priors instead of gamma priors (termed as DEPM). Furthermore,\nfor the DEPM, we derive truly in\ufb01nite DEPM (termed as IDEPM) by marginalizing out all\nmodel parameters including the in\ufb01nite atoms of the \u0393P prior.\n\n4.1 CEPM\n\nIn order to ensure that the EPM\u2019s intensity function depends solely on the \u0393P prior, a naive\nway is to introduce constraints for the hyperparameters of the gamma prior. In the CEPM,\nthe rate parameters of the gamma priors are constrained as b1 = C1 \u00d7 a1 and b2 = C2 \u00d7 a2,\nrespectively, where C1 > 0 and C2 > 0 are arbitrary constants. Based on the aforementioned\nconstraints and Theorem 1, the expectation of the intensity function for the CEPM depends\n\nonly on the \u0393P prior as E[P\u221e\n\nk=1 Ui,kVj,k\u03bbk] = \u03b30\n\nC1C2c0\n\n.\n\nThe posterior inference for the CEPM can be performed using Gibbs sampler in a manner\nsimilar to that for the EPM. However, we can not derive closed-form samplers only for a1\nand a2 because of the constraints. Thus, in this paper, posterior sampling for a1 and a2 are\nperformed using grid Gibbs sampling [13] (see Appendix B for details).\n\n4.2 DEPM\n\nWe have another strategy to construct the EPM with e\ufb03cient model shrinkage e\ufb00ect by\nre-parametrizing the factorization problem. Let us denote transpose of a matrix A by A\u22a4.\nAccording to the generative model of the EPM in Eq. (1), the original generative process\nfor counts m can be viewed as a matrix factorization as m \u2248 U \u03bbV \u22a4. It is clear that the\noptimal solution of the factorization problem is not unique. Let \u039b1 and \u039b2 be arbitrary\nK \u00d7 K non-negative diagonal matrices. If a solution m \u2248 U \u03bbV \u22a4 is globally optimal, then\nanother solution m \u2248 (U \u039b1)(\u039b\u22121\n2 )\u22a4 is also optimal. In order to ensure that\nthe EPM has only one optimal solution, we re-parametrize the original factorization problem\nto an equivalent constrained factorization problem as follows:\n\n1 \u03bb\u039b2)(V \u039b\u22121\n\nm \u2248 \u03c6\u03bb\u03c8\u22a4,\n\n(3)\n\nwhere \u03c6 denotes an I \u00d7 K non-negative matrix with l1-constraints as Pi \u03c6i,k = 1, \u2200k.\nSimilarly, \u03c8 denotes an J \u00d7 K non-negative matrix with l1-constraints asPj \u03c8j,k = 1, \u2200k.\n\nThis parameterization ensures the uniqueness of the optimal solution for a given m because\neach column of \u03c6 and \u03c8 is constrained such that it is de\ufb01ned on a simplex.\n\n4\n\n\f{\n\nz\n\nJ\n\n}|\n\nI\n\n}|\n\n{\u03c8j,k}J\n\nj=1 | \u03b12 \u223c Dirichlet(\n\ni=1 | \u03b11 \u223c Dirichlet(\n\n\u03b11, . . . , \u03b11),\n\nAccording to the factorization in Eq. (3), by incorporating Dirichlet priors instead of gamma\npriors, the generative construction for m of the DEPM is as follows:\n\n\u03b12, . . . , \u03b12), \u03bbk | \u03b30, c0 \u223c Gamma(\u03b30/T, c0).\n\n\u03c6i,k\u03c8j,k\u03bbk! , {\u03c6i,k}I\nz\n\nmi,j,\u00b7 | \u03c6, \u03c8, \u03bb \u223c Poisson TXk=1\n{\nTheorem 3. The expectation of DEPM\u2019s intensity functionP\u221e\non the \u0393P prior and can be expressed as E[P\u221e\nvariables and the low of total expectation, the proof is completed as E [P\u221e\nP\u221e\n\nProof. The expectations of Dirichlet random variables \u03c6i,k and \u03c8j,k are 1\nJ , respec-\ntively. Similar to the proof for Theorem 1, using the multiplicativity of independent random\nk=1 \u03c6i,k\u03c8j,k\u03bbk] =\n\n(4)\nk=1 \u03c6i,k\u03c8j,k\u03bbk depends sorely\nk=1 \u03c6i,k\u03c8j,k\u03bbk] = \u03b30\nIJ c0\n\nNote that, if we set constants C1 = I and C2 = J for the CEPM in Sec. 4.1, then the\nexpectation of the intensity function for the CEPM is equivalent to that for the DEPM in\nTheorem 3. Thus, in order to ensure the fairness of comparisons, we set C1 = I and C2 = J\nfor the CEPM in the experiments.\n\nJ \u00d7 E[P\u221e\n\nE[\u03c6i,k]E[\u03c8j,k]E[\u03bbk] = 1\n\nI and 1\n\nk=1 \u03bbk].\n\nI \u00d7 1\n\nk=1\n\n.\n\nAs the Gibbs sampler for \u03c6 and \u03c8 can be derived straightforwardly, the posterior inference\nfor all parameters and hyperparameters of the DEPM also can be performed via closed-\nform Gibbs sampler (detailed in Appendix C). Di\ufb00er from the CEPM, l1-constraints in the\nDEPM ensure the uniqueness of its optimal solution. Thus, the inference for the DEPM is\nconsidered as more e\ufb03cient than that for the CEPM.\n\n4.3 Truly In\ufb01nite DEPM (IDEPM)\n\nOne remarkable property of the DEPM is that we can derive a fully marginalized likelihood\nfunction. Similar to the beta-negative binomial topic model [13], we consider a joint distribu-\ntion for mi,j,\u00b7 Poisson customers and their assignments zi,j = {zi,j,s}mi,j,\u00b7\ns=1 \u2208 {1, \u00b7 \u00b7 \u00b7 , T }mi,j,\u00b7\ns=1 P (zi,j,s | mi,j,\u00b7, \u03c6, \u03c8, \u03bb).\nThanks to the l1-constraints we introduced in Eq. (3), the joint distribution P (m, z | \u03c6, \u03c8, \u03bb)\nhas a fully factorized form (see Lemma 1 in Appendix D). Therefore, marginalizing \u03c6, \u03c8,\nand \u03bb out according to the prior construction in Eq. (4), we obtain an analytical marginal\nlikelihood P (m, z) for the truncated DEPM (see Appendix D for a detailed derivation).\n\nto T tables as P (mi,j,\u00b7, zi,j | \u03c6, \u03c8, \u03bb) = P (mi,j,\u00b7 | \u03c6, \u03c8, \u03bb)Qmi,j,\u00b7\n\nFurthermore, by taking T \u2192 \u221e, we can derive a closed-form marginal likelihood for the\ntruly in\ufb01nite version of the DEPM (termed as IDEPM). In a manner similar to that in [14],\nwe consider the likelihood function for partition [z] instead of the assignments z. Assume\n\nwe have K+ of T atoms for which m\u00b7,\u00b7,k = PiPj mi,j,k > 0, and a partition of M (=\nPiPj mi,j,\u00b7) customers into K+ subsets. Then, joint marginal likelihood of the IDEPM\n\nfor [z] and m is given by the following theorem, with the proof provided in Appendix D:\nTheorem 4. The marginal likelihood function of the IDEPM is de\ufb01ned as P (m, [z])\u221e =\nlimT \u2192\u221e P (m, [z]) = limT \u2192\u221e\n\n(T \u2212K+)! P (m, z), and can be derived as follows:\n\nT !\n\nP (m, [z])\u221e =\n\n\u00d7\n\nK+Yk=1\n\nIYi=1\n\nJYj=1\n\n1\n\nmi,j,\u00b7!\n\n\u0393(J\u03b12)\n\n\u00d7\n\nK+Yk=1\nJYj=1\n\n\u0393(I\u03b11)\n\n\u0393(I\u03b11 + m\u00b7,\u00b7,k)\n\n\u0393(\u03b11 + mi,\u00b7,k)\n\n\u0393(\u03b11)\n\nIYi=1\n0 (cid:18) c0\n\n\u00d7 \u03b3K+\n\nc0 + 1(cid:19)\u03b30 K+Yk=1\n\n\u0393(\u03b12 + m\u00b7,j,k)\n\n\u0393(m\u00b7,\u00b7,k)\n\n(c0 + 1)m\u00b7,\u00b7,k\n\n,\n\n(5)\n\n\u0393(J\u03b12 + m\u00b7,\u00b7,k)\n\n\u0393(\u03b12)\n\nwhere mi,\u00b7,k = Pj mi,j,k, m\u00b7,j,k = Pi mi,j,k, and m\u00b7,\u00b7,k = PiPj mi,j,k. Note that \u0393(\u00b7)\n\ndenotes gamma function.\n\nFrom Eq. (5) in Theorem 4, we can derive collapsed Gibbs sampler [9] to perform posterior\ninference for the IDEPM. Since \u03c6, \u03c8, and \u03bb have been marginalized out, the only latent\nvariables we have to update are m and z.\n\n5\n\n\fSampling z: Given m, similar to the Chinese restaurant process (CRP) [15], the posterior\nprobability that zi,j,s is assigned to k\u2217 is given as follows:\n\nP (zi,j,s = k\u2217 | z\\(ijs), m) \u221d\uf8f1\uf8f2\uf8f3\n\n\u03b11+m\\(ijs)\ni,\u00b7,k\u2217\nI\u03b11+m\\(ijs)\n\u00b7,\u00b7,k\u2217\n\nm\\(ijs)\n\nk\u2217 \u00d7\nI \u00d7 1\n\n\u03b30 \u00d7 1\n\nJ\n\n\u00d7\n\n\u03b12+m\\(ijs)\n\u00b7,j,k\u2217\nI\u03b12+m\\(ijs)\n\u00b7,\u00b7,k\u2217\n\nif m\\(ijs)\n\n\u00b7,\u00b7,k\u2217 > 0,\n\nif m\\(ijs)\n\n\u00b7,\u00b7,k\u2217 = 0,\n\n(6)\n\nwhere the superscript \\(ijs) denotes that the corresponding statistics are computed exclud-\ning the s-th customer of entry (i, j).\n\nSampling m: Given z, posteriors for the \u03c6 and \u03c8 are simulated as {\u03c6i,k}I\ni=1 | \u2212 \u223c\nDirichlet({\u03b11 + mi,\u00b7,k}I\nj=1) for k \u2208\n{1, . . . , K+}. Furthermore, the posterior sampling of the \u03bbk for K+ active atoms can be\nperformed as \u03bbk | \u2212 \u223c Gamma(m\u00b7,\u00b7,k, c0 + 1). Therefore, similar to the sampler for the\nEPM [8], we can update m as follows:\n\nj=1 | \u2212 \u223c Dirichlet({\u03b12 + m\u00b7,j,k}J\n\ni=1) and {\u03c8j,k}J\n\nmi,j,\u00b7 | \u03c6, \u03c8, \u03bb \u223c(cid:26) \u03b4(0)\nk=1 | mi,j,\u00b7, \u03c6, \u03c8, \u03bb \u223c Multinomial\uf8eb\uf8edmi,j,\u00b7;(\n\nZTP(PK+\n\nk=1 \u03c6i,k\u03c8j,k\u03bbk)\n\n{mi,j,k}K+\n\nwhere \u03b4(0) denotes point mass at zero.\n\nif xi,j = 0,\nif xi,j = 1,\n\n\u03c6i,k\u03c8j,k\u03bbk\n\nk\u2032=1 \u03c6i,k\u2032\u03c8j,k\u2032\u03bbk\u2032)K+\nPK+\n\nk=1\n\n(7)\n\n(8)\n\n\uf8f6\uf8f8 ,\n\nSampling hyperparameters: We can construct closed-form Gibbs sampler for all hy-\nperparameters of the IDEPM assuming a gamma prior (Gamma(e0, f0)). Using the additive\nproperty of the \u0393P, posterior sample for the sum of \u03bbk over unused atoms is obtained as \u03bb\u03b30 =\nk\u2032=K++1 \u03bbk\u2032 | \u2212 \u223c Gamma(\u03b30, c0 + 1). Consequently, we obtain a closed-form posterior\nk=1 \u03bbk).\nFor all remaining hyperparameters (i.e., \u03b11, \u03b12, and \u03b30), we can derive posterior samplers\nfrom Eq. (5) using data augmentation techniques [12, 8, 2, 11] (detailed in Appendix E).\n\nP\u221e\nsampler for the rate parameter c0 of the \u0393P as c0 | \u2212 \u223c Gamma(e0 + \u03b30, f0 + \u03bb\u03b30 +PK+\n\n5 Experimental Results\n\nIn previous sections, we theoretically analysed the reason why the model shrinkage of the\nEPM does not work appropriately (Sec. 3) and proposed several novel constructions (i.e.,\nCEPM, DEPM, and IDEPM) of the EPM with an e\ufb03cient model shrinkage e\ufb00ect (Sec. 4).\n\nThe purpose of the experiments involves ascertaining the following hypotheses:\n\n(H1) The original EPM overestimates the number of active atoms and over\ufb01ts the data.\nIn contrast, the model shrinkage mechanisms of the CEPM and DEPM work ap-\npropriately. Consequently, the CEPM and DEPM outperform the EPM in general-\nization ability and link prediction accuracy.\n\n(H2) Compared with the CEPM, the DEPM indicates better generalization ability and\nlink prediction accuracy because of the uniqueness of the DEPM\u2019s optimal solution.\n(H3) The IDEPM with collapsed Gibbs sampler is superior to the DEPM in generalization\n\nability, link prediction accuracy, mixing e\ufb03ciency, and convergence speed.\n\nDatasets: The \ufb01rst dataset was the Enron [16] dataset, which comprises e-mails sent\nbetween 149 Enron employees. We extracted e-mail transactions from September 2001 and\nconstructed Enron09 dataset. For this dataset, xi,j = 1(0) was used to indicate whether\nan e-mail was, or was not, sent by the i-th employee to the j-th employee. For larger\ndataset, we used the MovieLens [17] dataset, which comprises \ufb01ve-point scale ratings of\nmovies submitted by users. For this dataset, we set xi,j = 1 when the rating was higher\nthan three and xi,j = 0 otherwise. We prepared two di\ufb00erent sized MovieLens dataset:\nMovieLens100K (943 users and 1,682 movies) and MovieLens1M (6,040 users and 3,706\nmovies). The densities of the Enron09, MovieLens100K and MovieLens1M datasets were\n0.016, 0.035, and 0.026, respectively.\n\n6\n\n\f(a) Enron09\n\nIDEPM\nDEPM-T\nCEPM-T\nEPM-T\n\n128\n\n \n\n \n\nK\n \nf\no\n#\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n64\n\n32\n\n16\n\n8\n\n4\n\n2\n\n128\n\n64\n\n32\n\n16\n\n8\n\n4\n\n2\n\n64\n\n128\n\n2\n\n4\n\n(b) MovieLens100K\n\n8\n\n32\nTruncation level T\n\n16\n\n128\n\n64\n\n32\n\n16\n\n8\n\n4\n\n2\n\n64\n\n128\n\n2\n\n4\n\n(c) MovieLens1M\n\n8\n\n32\nTruncation level T\n\n16\n\n64\n\n128\n\n8\n\n32\nTruncation level T\n\n16\n\n2\n\n4\n\n-0.040\n\n-0.050\n\n2\n\nL\nL\nD\nT\n\n-0.060\n\n-0.070\n\n-0.080\n\n-0.090\n\n-0.100\n\n-0.110\n\n-0.120\n\nR\nP\n-\nC\nU\nA\nD\nT\n\n0.350\n\n0.300\n\n0.250\n\n0.200\n\n0.150\n\n0.100\n\n0.050\n\n0.000\n\n(d) Enron09\n\n(e) MovieLens100K\n\n(f) MovieLens1M\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n-0.084\n\n-0.086\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\nIDEPM\nDEPM-T\nCEPM-T\nEPM-T\n\nTruncation level T\n\n(g) Enron09\n\nIDEPM\nDEPM-T\nCEPM-T\nEPM-T\n\n-0.088\n\n-0.090\n\n-0.092\n\n-0.094\n\n-0.096\n\n-0.098\n\n-0.100\n\n0.470\n\n0.450\n\n0.430\n\n0.410\n\n0.390\n\n0.370\n\nTruncation level T\n\n(h) MovieLens100K\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\nTruncation level T\n\n(i) MovieLens1M\n\n-0.066\n\n-0.068\n\n-0.070\n\n-0.072\n\n-0.074\n\n-0.076\n\n-0.078\n\n0.440\n\n0.420\n\n0.400\n\n0.380\n\n0.360\n\n0.340\n\n0.320\n\n0.300\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\n2\n\n4\n\n8\n\n16\n\n32\n\n64\n\n128\n\nTruncation level T\n\nTruncation level T\n\nTruncation level T\n\nFigure 2: Calculated measurements as functions of the truncation level T for each dataset.\nThe horizontal line in each \ufb01gure denotes the result obtained using the IDEPM.\n\nEvaluating Measures: We adopted three measurements to evaluate the performance of\nthe models. The \ufb01rst is the estimated number of active atoms K for evaluating the model\nshrinkage e\ufb00ect of each model. The second is the averaged Test Data Log Likelihood (TDLL)\nfor evaluating the generalization ability of each model. We calculated the averaged likelihood\nthat a test entry takes the actual value. For the third measurement, as many real-world\nbinary matrices are often sparse, we adopted the Test Data Area Under the Curve of the\nPrecision-Recall curve (TDAUC-PR) [18] to evaluate the link prediction ability. In order\nto calculate the TDLL and TDAUC-PR, we set all the selected test entries as zero during\nthe inference period, because binary observations for unobserved entries are not observed\nas missing values but are observed as zeros in many real-world situations.\n\nExperimental Settings: Posterior inference for the truncated models (i.e., EPM, CEPM,\nand DEPM) were performed using standard (non-collapsed) Gibbs sampler. Posterior infer-\nence for the IDEPM was performed using the collapsed Gibbs sampler derived in Sec. 4.3.\nFor all models, we also sampled all hyperparameters assuming the same gamma prior\n(Gamma(e0, f0)). For the purpose of fair comparison, we set hyper-hyperparameters as\ne0 = f0 = 0.01 throughout the experiments. We ran 600 Gibbs iterations for each model on\neach dataset and used the \ufb01nal 100 iterations to calculate the measurements. Furthermore,\nall reported measurements were averaged values obtained by 10-fold cross validation.\n\nResults: Hereafter, the truncated models are denoted as EPM-T , CEPM-T , and DEPM-T\nto specify the truncation level T . Figure 2 shows the calculated measurements.\n\n(H1) As shown in Figs. 2a\u2013c, the EPM overestimated the number of active atoms K for all\ndatasets especially for a large truncation level T . In contrast, the number of active atoms\nK for the CEPM-T and DEPM-T monotonically converges to a speci\ufb01c value. This result\nsupports the analysis with respect to the relationship between the model shrinkage e\ufb00ect\nand the expectation of the EPM\u2019s intensity function, as discussed in Sec. 3. Consequently,\n\n7\n\n\fL\nL\nD\nT\n\n-0.050\n\n-0.055\n\n-0.060\n\n-0.065\n\n-0.070\n\n-0.075\n\n-0.080\n\n(a) Enron09\n\n0\n\n100 200 300 400 500 600\n\nIDEPM\nDEPM-128\n\nGibbs sampling iteration\n\n-0.086\n\n-0.088\n\n-0.090\n\n-0.092\n\n-0.094\n\n-0.096\n\n-0.098\n\n(b) MovieLens100K\n\n(c) MovieLens1M\n\n0\n\n100 200 300 400 500 600\n\nGibbs sampling iteration\n\n-0.066\n\n-0.068\n\n-0.070\n\n-0.072\n\n-0.074\n\n-0.076\n\n-0.078\n\n-0.080\n\n0\n\n100 200 300 400 500 600\n\nGibbs sampling iteration\n\nFigure 3: (Best viewed in color.) The TDLL as a function of the Gibbs iterations.\n\n-0.050\n\n-0.055\n\n-0.060\n\nL\nL\nD\nT\n\n-0.065\n\n-0.070\n\n-0.075\n\n-0.080\n\n(a) Enron09\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\nIDEPM\nDEPM-64\nDEPM-16\nDEPM-4\n\nDEPM-128\nDEPM-32\nDEPM-8\nDEPM-2\n\nElapsed time (sec)\n\n-0.080\n\n-0.085\n\n-0.090\n\n-0.095\n\n-0.100\n\n-0.105\n\n-0.110\n\n(b) MovieLens100K\n\n-0.065\n\n(c) MovieLens1M\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0\n\n2000 4000 6000 8000 10000\n\n-0.070\n\n-0.075\n\n-0.080\n\n-0.085\n\nElapsed time (sec)\n\nElapsed time (sec)\n\nFigure 4: (Best viewed in color.) The TDLL as a function of the elapsed time (in seconds).\n\nas shown by the TDLL (Figs. 2d\u2013f) and TDAUC-PR (Figs. 2g\u2013i), the CEPM and DEPM\noutperformed the original EPM in both generalization ability and link prediction accuracy.\n\n(H2) As shown in Figs. 2a\u2013c, the model shrinkage e\ufb00ect of the DEPM is stronger than\nthat of the CEPM. As a result, the DEPM signi\ufb01cantly outperformed the CEPM in both\ngeneralization ability and link prediction accuracy (Figs. 2d\u2013i). Although the CEPM slightly\noutperformed the EPM, the CEPM with a larger T tends to over\ufb01t the data. In contrast,\nthe DEPM indicated its best performance with the largest truncation level (T = 128).\nTherefore, we con\ufb01rmed that the uniqueness of the optimal solution in the DEPM was\nconsiderably important in achieving good generalization ability and link prediction accuracy.\n\n(H3) As shown by the horizontal lines in Figs. 2d\u2013i, the IDEPM indicated the state-of-the-\nart scores for all datasets. Finally, the computational e\ufb03ciency of the IDEPM was compared\nwith that of the truncated DEPM. Figure 3 shows the TDLL as a function of the number\nof Gibbs iterations. In keeping with expectations, the IDEPM indicated signi\ufb01cantly better\nmixing property when compared with that of the DEPM for all datasets. Furthermore,\nFig. 4 shows a comparison of the convergence speed of the IDEPM and DEPM with several\ntruncation levels (T = {2, 4, 8, 16, 32, 64, 128}). As clearly shown in the \ufb01gure, the conver-\ngence of the IDEPM was signi\ufb01cantly faster than that of the DEPM with all truncation\nlevels. Therefore, we con\ufb01rmed that the IDEPM indicated a state-of-the-art performance\nin generalization ability, link prediction accuracy, mixing e\ufb03ciency, and convergence speed.\n\n6 Conclusions\n\nIn this paper, we analysed the model shrinkage e\ufb00ect of the EPM, which is a Bayesian\nnonparametric model for extracting overlapping structure with an optimal dimension from\nbinary matrices. We derived the expectation of the intensity function of the EPM, and\nshowed that the redundancy of the EPM\u2019s intensity function disturbs its model shrinkage\ne\ufb00ect. According to this \ufb01nding, we proposed two novel generative construction for the\nEPM (i.e., CEPM and DEPM) to ensure that its model shrinkage e\ufb00ect works appropri-\nately. Furthermore, we derived a truly in\ufb01nite version of the DEPM (i.e, IDEPM), which\ncan be inferred using collapsed Gibbs sampler without any approximation for the \u0393P. We ex-\nperimentally showed that the model shrinkage mechanism of the CEPM and DEPM worked\nappropriately. Furthermore, we con\ufb01rmed that the proposed IDEPM indicated a state-of-\nthe-art performance in generalization ability, link prediction accuracy, mixing e\ufb03ciency, and\nconvergence speed. It is of interest to further investigate whether the truly in\ufb01nite construc-\ntion of the IDEPM can be applied to more complex and modern machine learning models,\nincluding deep brief networks [19], and tensor factorization models [20].\n\n8\n\n\fReferences\n\n[1] Thomas S. Ferguson. \u201cA Bayesian Analysis of Some Nonparametric Problems\u201d. In:\n\nThe Annals of Statistics 1.2 (1973), pp. 209\u2013230.\n\n[2] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. \u201cHierarchical\n\nDirichlet Processes\u201d. In: J. Am. Stat. Assoc. 101.476 (2006), pp. 1566\u20131581.\n\n[3] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Gri\ufb03ths, Takeshi Yamada, and\nNaonori Ueda. \u201cLearning Systems of Concepts with an In\ufb01nite Relational Model\u201d. In:\nProc. AAAI. Vol. 1. 2006, pp. 381\u2013388.\n\n[4] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. \u201cMixed\nMembership Stochastic Blockmodels\u201d. In: J. Mach. Learn. Res. 9 (2008), pp. 1981\u2013\n2014.\n\n[5] Thomas L. Gri\ufb03ths and Zoubin Ghahramani. \u201cIn\ufb01nite Latent Feature Models and\n\nthe Indian Bu\ufb00et Process\u201d. In: Proc. NIPS. 2005, pp. 475\u2013482.\n\n[6] Morten M\u00f8rup, Mikkel N. Schmidt, and Lars Kai Hansen. \u201cIn\ufb01nite Multiple Mem-\nbership Relational Modeling for Complex Networks\u201d. In: Proc. MLSP. 2011, pp. 1\u2013\n6.\n\n[7] Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. \u201cAn In\ufb01nite Latent\n\nAttribute Model for Network Data\u201d. In: Proc. ICML. 2012, pp. 1607\u20131614.\n\n[8] Mingyuan Zhou. \u201cIn\ufb01nite Edge Partition Models for Overlapping Community Detec-\n\ntion and Link Prediction\u201d. In: Proc. AISTATS. Vol. 38. 2015, pp. 1135\u20131143.\n\n[9] Jun S. Liu. \u201cThe Collapsed Gibbs Sampler in Bayesian Computations with Applica-\ntions to a Gene Regulation Problem\u201d. In: J. Am. Stat. Assoc. 89.427 (1994), pp. 958\u2013\n966.\n\n[10] Charles J. Geyer. Lower-Truncated Poisson and Negative Binomial Distributions.\nTech. rep. Working Paper Written for the Software R. University of Minnesota, MN\n(available: http://cran.r-project.org/web/packages/aster/vignettes/trunc.pdf), 2007.\n[11] David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. \u201cDistributed\n\nAlgorithms for Topic Models\u201d. In: J. Mach. Learn. Res. 10 (2009), pp. 1801\u20131828.\n\n[12] Michael D. Escobar and Mike West. \u201cBayesian Density Estimation and Inference Using\n\nMixtures\u201d. In: J. Am. Stat. Assoc. 90 (1994), pp. 577\u2013588.\n\n[13] Mingyuan Zhou. \u201cBeta-Negative Binomial Process and Exchangeable Random Parti-\n\ntions for Mixed-Membership Modeling\u201d. In: Proc. NIPS. 2014, pp. 3455\u20133463.\n\n[14] Thomas L. Gri\ufb03ths and Zoubin Ghahramani. \u201cThe Indian Bu\ufb00et Process: An Intro-\n\nduction and Review\u201d. In: J. Mach. Learn. Res. 12 (2011), pp. 1185\u20131224.\n\n[15] David Blackwell and James B. MacQueen. \u201cFerguson distributions via Polya urn\n\nschemes\u201d. In: The Annals of Statistics 1 (1973), pp. 353\u2013355.\n\n[16] Bryan Klimat and Yiming Yang. \u201cThe Enron Corpus: A New Dataset for Email Clas-\n\nsi\ufb01cation Research\u201d. In: Proc. ECML. 2004, pp. 217\u2013226.\n\n[17] MovieLens\n\ndataset,\n\nhttp://www.grouplens.org/.\n\nas\n\nof\n\n2003.\n\nurl:\n\nhttp://www.grouplens.org/.\n\n[18] Jesse Davis and Mark Goadrich. \u201cThe Relationship Between Precision-Recall and\n\nROC Curves\u201d. In: Proc. ICML. 2006, pp. 233\u2013240.\n\n[19] Mingyuan Zhou, Yulai Cong, and Bo Chen. \u201cThe Poisson Gamma Belief Network\u201d.\n\nIn: Proc. NIPS. 2015, pp. 3043\u20133051.\n\n[20] Changwei Hu, Piyush Rai, and Lawrence Carin. \u201cZero-Truncated Poisson Tensor Fac-\n\ntorization for Massive Binary Tensors\u201d. In: Proc. UAI. 2015, pp. 375\u2013384.\n\n9\n\n\f", "award": [], "sourceid": 299, "authors": [{"given_name": "Iku", "family_name": "Ohama", "institution": "Panasonic Corporation"}, {"given_name": "Issei", "family_name": "Sato", "institution": "The University of Tokyo/RIKEN"}, {"given_name": "Takuya", "family_name": "Kida", "institution": "Hokkaido University"}, {"given_name": "Hiroki", "family_name": "Arimura", "institution": "Hokkaido University"}]}