{"title": "Factorized Asymptotic Bayesian Inference for Latent Feature Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1214, "page_last": 1222, "abstract": "This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models~(LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hesqsian matrix of a complete log-likelihood, which is required to derive a factorized information criterion''~(FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency.\"", "full_text": "Factorized Asymptotic Bayesian Inference\n\nfor Latent Feature Models\n\nKohei Hayashi\u2217\u2020\n\n\u2217National Institute of Informatics\n\n\u2020JST, ERATO, Kawarabayashi Large Graph Project\n\nRyohei Fujimaki\n\nNEC Laboratories America\n\nrfujimaki@nec-labs.com\n\nkohei-h@nii.ac.jp\n\nAbstract\n\nThis paper extends factorized asymptotic Bayesian (FAB) inference for latent fea-\nture models (LFMs). FAB inference has not been applicable to models, includ-\ning LFMs, without a speci\ufb01c condition on the Hessian matrix of a complete log-\nlikelihood, which is required to derive a \u201cfactorized information criterion\u201d (FIC).\nOur asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs\nhas the same form as those of mixture models. FAB/LFMs have several desir-\nable properties (e.g., automatic hidden states selection and parameter identi\ufb01abil-\nity) and empirically perform better than state-of-the-art Indian Buffet processes in\nterms of model selection, prediction, and computational ef\ufb01ciency.\n\n1 Introduction\n\nFactorized asymptotic Bayesian (FAB) inference is a recently-developed Bayesian approximation\ninference method for model selection of latent variable models [5, 6]. FAB inference maximizes\na computationally tractable lower bound of a \u201cfactorized information criterion\u201d (FIC) which con-\nverges to a marginal log-likelihood for a large sample limit. In application with respect to mixture\nmodels (MMs) and hidden Markov models, previous work has shown that FAB inference achieves\nas good or even better model selection accuracy as state-of-the-art non-parametric Bayesian (NPB)\nmethods and variational Bayesian (VB) methods with less computational cost. One of the interesting\ncharacteristics of FAB inference is that it estimates both models (e.g., the number of mixed compo-\nnents for MMs) and parameter values without priors (i.e., it asymptotically ignores priors), and it\ndoes not have a hand-tunable hyper-parameter. With respect to the trade-off between controllability\nand automation, FAB inference places more importance on automation.\nAlthough FAB inference is a promising model selection method, as yet it has only been applicable to\nmodels satisfying a speci\ufb01c condition that the Hessian matrix of a complete log-likelihood (i.e., of a\nlog-likelihood over both observed and latent variables) must be block diagonal, with only a part of\nthe observed samples contributing individual sub-blocks. Such models include basic latent variable\nmodels as MMs [6]. The application of FAB inference to more advanced models that do not satisfy\nthe condition remains to be accomplished.\nThis paper extends an FAB framework to latent feature models (LFMs) [9, 17]. Model selection for\nLFMs (i.e., determination of the dimensionality of latent features) has been addressed by NBP and\nVB methods [10, 3]. Although they have shown promising performance in such applications as link\nprediction [16], their high computational costs restrict their applications to large-scale data.\nOur asymptotic analysis of the Hessian matrix of the log-likelihood shows that FICs for LFMs have\nthe same form as those for MMs, despite the fact that LFMs do not satisfy the condition explained\nabove (see Lemma 1). Eventually, as FAB/MMs, FAB/LFMs offer several desirable properties, such\nas FIC convergence to a marginal log-likelihood, automatic hidden states selection, and monotonic\nincrease in the lower FIC bound through iterative optimization. Further we conduct two analysis in\n\n1\n\n\fSection 3: 1) we relate FAB E-steps to a convex concave procedure (CCCP) [29]. Inspired by this\nanalysis, we propose a shrinkage acceleration method which drastically reduces computational cost\nin practice, and 2) we show that FAB/LFMs have parameter identi\ufb01ability. This analysis offers a\nnatural guide to the merging post-processing of latent features. Rigorous proofs and assumptions\nwith respect to the main results are given in the supplementary materials.\nNotation In this paper, we denote the (i, j)-th element, the i-th row vector, and the j-th column\nvector of A by aij, ai, and a\u00b7j, respectively.\n\n}\n\n{\u2211\n\n1.1 Related Work\nFIC for MMs Suppose we have N \u00d7 D observed data X and N \u00d7 K latent variables Z. FIC\nconsiders the following alternative representation of the marginal log-likelihood:\nlog p(X|M) = max\np(X, Z|P)p(P|M)dP, (1)\n\u2211\nwhere q(Z) is a variational distribution on Z; M and P are a model and its parameter, respec-\nIn the case of MMs, log p(X, Z|P) can be factorized into log p(Z) and log p(X|Z) =\ntively.\nk log pk(X|z\u00b7k), where pk is the k-th observation distribution (we here omit parameters for\nnotational simplicity.) We can then approximate p(X, Z|M) by individually applying Laplace\u2019s\nmethod [28] to log p(Z) and log pk(X|z\u00b7k):\n\n, p(X, Z|M) =\n\np(X, Z|M)\n\nq(Z) log\n\n\u222b\n\nq(Z)\n\nZ\n\nq\n\n(2\u03c0)DZ /2\n\nN DZ /2 det|FZ|1/2\n\np(X, Z|M) \u2248 p(X, Z| \u02c6P)\n(2)\n\u2211\nwhere \u02c6P is the maximum likelihood estimator (MLE) of p(X, Z|P).1 DZ and Dk are the param-\neter dimensionalities of p(Z) and pk(X|z\u00b7k), respectively. FZ and Fk are \u2212\u2207\u2207 log p(Z)| \u02c6P /N\n[\nand \u2212\u2207\u2207 log pk(X|z\u00b7k)| \u02c6P /(\nn znk), respectively. Under conditions for asymptotic ignoring of\nlog det|FZ| and log det|Fk|, substituting Eq.(2) into (1) gives the FIC for MMs as follows:\n\n]\n\nk=1\n\n(\n\n,\n\n\u2211\n\n\u2211\n\nFICMM \u2261 max\n\nEq\n\nq\n\nlog p(X, Z| \u02c6P) \u2212 DZ\n2\n\nlog N \u2212\n\nDk\n2\n\nk\n\nlog\n\nznk\n\nn\n\nK\u220f\n\n\u2211\nn znk)Dk/2 det|Fk|1/2\n\n(2\u03c0)Dk/2\n\n+ H(q),\n\n(3)\n\n\u2211\n\nn znk), which\nwhere H(q) is the entropy of q(Z). The most important term in FICMM (3) is log(\noffers such theoretically desirable properties for FAB inference as automatic shrinkage of irrelevant\nlatent variables and parameter identi\ufb01ability [6].\n\u2211\nDirect optimization of FICMM is dif\ufb01cult because: (i) evaluation of Eq[log\nn znk] is computa-\ntionally infeasible, and (ii) the MLE is not available in principle. Instead, FAB optimizes a tractable\nlower bound of an FIC [6]. For (i), since \u2212 log\n\u2211\nn znk is a convex function, its linear approximation\nat N \u02dc\u03c0k > 0 yields the lower bound:\n\n\u2211\n\n\u2211\n\n\u2211\n\n\u2211\n\n(\n\n)\n\n[\n\n]\n\nznk\n\n(4)\nwhere 0 < \u02dc\u03c0k \u2264 1 is a linearization parameter. For (ii), since, from the de\ufb01nition of the MLE, the\ninequality log p(X, Z| \u02c6P) \u2265 log p(X, Z|P) holds for any P, we optimize P along with q. Alternat-\ning maximization of the lower bound with respect to q, P, and \u02dc\u03c0 guarantees a monotonic increase\nin the FIC lower bound [6].\n\nlog N \u02dc\u03c0k +\n\nEq[znk]/N \u2212 \u02dc\u03c0k\n\n\u2265 \u2212\n\nDk\n2\n\nDk\n2\n\nEq\n\n\u2212\n\nlog\n\n\u02dc\u03c0k\n\nn\n\nn\n\nk\n\nk\n\n,\n\nIn\ufb01nite LFMs and Indian Buffet Process The IBP [10, 11] is a nonparametric prior over in\ufb01-\nnite LFMs. It enables us to express an in\ufb01nite number of latent features, and making it possible to\nadjust model complexity on the basis of observations. In\ufb01nite IBPs have still been actively stud-\nied in terms of both applications (e.g., link prediction [16]) and model representations (e.g., latent\nattribute models [19]). Since naive Gibbs sampling requires unrealistic computational cost, accel-\neration algorithms such as accelerated sampling [2] and VB [3] have been developed. Reed and\nGhahramani [22] have recently proposed an ef\ufb01cient MAP estimation framework of an IBP model\nvia submodular optimization, which is referred to as maximum-expectation IBP (MEIBP). As simi-\nlar to FIC, \u201cMAD-Bayes\u201d [1] considers asymptotics of MMs and LFMs, but it is based on a limiting\ncase that the noise variance goes to zero, which yields a prior-derived regularization term.\n\n1While p(X|P) is a non-regular model, P (X, Z|P) is a regular model (i.e., the Fisher information is non-\n\nsingular at the ML estimator,) and Fk and FZ have their inversions at \u02c6P.\n\n2\n\n\f2 FIC and FAB Algorithm for LFMs\nLFMs assume underlying relationships for X with binary features Z \u2208 {0, 1}N\u00d7K and linear bases\nW \u2208 RD\u00d7K such that, for n = 1, . . . , N,\n\n>\n\nxn = Wzn + b + \u03b5n,\n\n(5)\nwhere \u03b5n \u223c N (0, \u039b\n\u22121) is the Gaussian noise having the diagonal precision matrix \u039b \u2261 diag(\u03bb),\nand b \u2208 RD is a bias term. For later convenience, we de\ufb01ne the centered observation \u00afX = X \u2212\n. Z follows a Bernoulli prior distribution znk \u223c Bern(\u03c0k) with a mean parameter \u03c0k. The\n1b\nparameter set P is de\ufb01ned as P \u2261 {W, b, \u03bb, \u03c0}. Also, we denote parameters with respect to the\nd-th dimension as \u03b8d = (wd, bd, \u03bbd). Similarly with other FAB frameworks, the log-priors of P are\nassumed to be constant with respect to N, i.e., limN\u2192\u221e log p(P|M)\nIn the case of MMs, we implicitly use the fact that: A1) parameters of pk(X|z\u00b7k) are mutually inde-\npendent for k = 1, . . . , K (in other words, \u2207\u2207 log p(X|Z) is block diagonal having K blocks), and\nA2) the number of observations which contribute \u2207\u2207 log pk(X|z\u00b7k) is\nn znk. These conditions\nnaturally yield the FAB regularization term log\nn znk by the Laplace approximation of MMs (2).\nHowever, since \u03b8d is shared by all latent features in LFMs, A1 and A2 are not satis\ufb01ed. In the next\nsection, we address this issue and derive FIC for LFMs.\n\n\u2211\n\n\u2211\n\n= 0\n\nN\n\n2.1 FICs for LFMs\n\n\u2211\n\n\u2211\n\nk\n\nThe following lemma plays the most important role in our derivation of FICs for LFMs.\nLemma 1. Let F(d) be the Hessian matrix of the negated log-likelihood with respect to \u03b8d, i.e.,\n\u2212\u2207\u2207 log p(x\u00b7d|Z, \u03b8d). Under some mild assumptions (see the supplementary materials), the fol-\nlowing equality holds:\n\nlog\n\n+ Op(1).\n\nn znk\nN\n\nlog det|F(d)| =\n\u2211\nn znk term naturally appears in log det|F(d)| without A1 and A2.\nAn important fact is that the log\nLemma 1 induces the following theorem, which states an asymptotic approximation of a marginal\ncomplete log-likelihood, log p(X, Z|M).\nTheorem 2. If Lemma 1 holds and the joint marginal log-likelihood is bounded for a suf\ufb01ciently\nlarge N, it can be asymptotically approximated as:\nlog p(X, Z|M) = J(Z, \u02c6P) + Op(1),\nJ(Z,P) \u2261 log p(X, Z|P) \u2212 |P| \u2212 DK\n\n\u2211\n\n\u2211\n\n(6)\n\n(7)\n\n(8)\n\nlog\n\nznk.\n\nlog N \u2212 D\n2\n\n2\n\nk\n\nn\n\n\u2211\n\nIt is worth noting that, if we evaluate the model complexity of \u03b8d (log det|F(d)|) by N, i.e.,\n\u2211\nif we apply Laplace\u2019s method without Lemma 1, Eq. (7) falls into Bayesian Information Crite-\nrion [23], which tells us that the model complexity relevant to \u03b8d increases not O(K log N ) but\nO(\nBy substituting approximation (7) into Eq. (1), we obtain the FIC of the LFM as follows:\n\nn znk).\n\nk log\n\nFICLFM \u2261 max\n\nEq[J(Z, \u02c6P)] + H(q).\n\nq\n\n(9)\n\nIt is interesting that FICLFM (9) and FICMM (3) have exactly the same representation despite the\nfact that LFMs do not satisfy A1 and A2. This indicates the wide applicability of FICs and suggests\nthat FIC representation of approximated marginal log-likelihoods is feasible not only for MMs but\nalso for more general (discrete) latent variable models.\nSince the asymptotic constant terms of Eq. (7) are not affected by the expectation of q(Z), the\ndifference between the FIC and the marginal log-likelihood is asymptotically constant; in other\nwords, the distance between log p(X|M)/N and FICLFM/N is asymptotically small.\nCorollary 3. For N \u2192 \u221e, log p(X|M) = FICLFM + Op(1) holds.\n\n3\n\n\f2.2 FAB/LFM Algorithm\n\n\u220f\n\nAs with the case of MMs (3), FICLFM is not available in practice, and we employ the lower bound-\ning techniques (i) and (ii). For LFMs, we further introduce a mean-\ufb01led approximation on Z, i.e., we\nk \u02dcq(znk|\u00b5nk), where \u02dcq(z|\u00b5) is a Bernoulli\nrestrict the class of q(zn) to a factorized form: q(zn) =\ndistribution with a mean parameter \u00b5 = Eq[z]. Rather than this approximation\u2019s making the FIC\nlower bound looser (the equality (1) no longer holds), the variational distribution has a closed-form\nsolution. Note that this approximation does not cause signi\ufb01cant performance degradation in VB\ncontexts [20, 25]. The VB-extension of IBP [3] also uses this factorized assumption.\nBy applying (i), (ii), and the mean-\ufb01eld approximation, we obtain the lower bound: L(q,P, \u02dc\u03c0) =\n\nEq [log p(X|Z, \u0398) + log p(Z|\u03c0) + RHS of (4)] \u2212 2D + K\n\n2\n\nlog N +\n\nH(q(zn)).\n\nn\n\n(10)\n\n\u2211\n\nAn FAB algorithm alternatingly maximizes L(q,P, \u02dc\u03c0) with respect to {{\u00b5n\n},P, \u02dc\u03c0}. Notice that\nthe algorithm described below monotonically increases L in every single step, and therefore we are\nguaranteed to obtain a local maximum. This monotonic increase in L gives us a natural stopping\ncondition with a tolerance \u03b4: if (Lt \u2212 Lt\u22121)/N < \u03b4 then stop the algorithm, where we denote the\nvalue of L at the t-th iteration by Lt.\n\nl6=k \u00b5nlw\u00b7l \u2212 1\n\nFAB E-step In the FAB E-step, we update \u00b5n in a way similar to that with the variational mean-\n\ufb01eld inference in a restricted Boltzmann machine [20]. Taking the gradient of L with respect to \u00b5n\nand setting it to zero yields the following \ufb01xed-point equations:\n\n\u00b7k\u039b(\u00afxn \u2212\u2211\n\u2211\n\n\u00b5nk = g (cnk + \u03b7(\u03c0k) \u2212 D/2N \u02dc\u03c0k)\n>\n\u22121 is the sigmoid function, cnk = w\n\n\u2212 \u00b52\nn).\n, which originated in the log\n\n(11)\nwhere g(x) = (1 + exp(\u2212x))\n2 w\u00b7k),\nis a natural parameter of the prior of z\u00b7k. Update equation (11) is a form of\nand \u03b7(\u03c0k) = log \u03c0k\n1\u2212\u03c0k\ncoordinate descent, and every update is guaranteed to increase the lower bound [25]. After several\niterations of Eq. (11) over k = 1, . . . , K, we are able to obtain a local maximum of Eq[zn] = \u00b5n\n>\nand Eq[znz\nn ] = \u00b5n\u00b5\nOne unique term in Eq. (11) is \u2212 D\nn znk term in Eq. (8). In\nupdating \u00b5nk (11), the smaller \u02dc\u03c0k (or equivalent to \u03c0k by Eq. (12)) is, the smaller \u00b5nk is. And a\nsmaller \u00b5nk is likely to induce a smaller \u02dc\u03c0k (see Eq. (12)). This results in the shrinking of irrelevant\nfeatures, and therefore FAB/LFMs are capable of automatically selecting feature dimensionality\nK. This regularization effect is induced independently of prior (i.e., asymptotic ignorance of prior)\nand is known as \u201cmodel induced regularization\u201d which is caused by Bayesian marginalization in\nsingular models [18]. Notice that Eq. (11) offers another shrinking effect, by means of \u03b7(\u03c0k), which\nis a prior-based regularization. We empirically show that the latter shrinking effect is too weak to\nmitigate over-\ufb01tting and the FAB algorithm achieves faster convergence, with respect to N, to the\ntrue model (see Section 4.) Note that if we only use the effect of \u03b7(\u03c0k) (i.e. setting D/2N \u02dc\u03c0k = 0),\nthen update equation (11) is equivalent to that of variational EM.\n\n>\nn + diag(\u00b5n\n\n2N \u02dc\u03c0k\n\nFAB M-step The FAB M-step is equivalent to the M-step in the EM algorithm of LFMs; the\nsolutions of W, \u039b and b are given as in closed form and is exactly the same as those of PPCA [24]\n(see the supplementary materials.) For \u02dc\u03c0 and \u03c0, we obtain the following solutions:\n\n\u2211\n\n\u03c0k = \u02dc\u03c0k =\n\n\u00b5nk/N.\n\nn\n\n(12)\n\nShrinkage step As we have explained, in principle, the FAB regularization term D\nin Eq. (11)\n2N \u02dc\u03c0k\n\u2211\nautomatically eliminates irrelevant latent features. While the elimination does not change the value\nof Eq[log(X|Z,P)], removing them from the model increases L due to a decrease in model com-\n\u2211\nplexity. We eliminate shrunken features after FAB E-step in terms of that LFMs approximate X by\nn \u00b5nk/N = 0, the k-th feature does not affect to the approximation\nn \u00b5nk/N = 1, wk can be seen as a\n>\n\u00b7l + 1w\n\n>\n>\n. When\nk \u00b5\u00b7kw\n\u00b7k + 1b\n>\n>\nl6=k z\u00b7lw\nl z\u00b7lw\n\u00b7l ), and we simply remove it. When\n\u00b7l =\nl z\u00b7lw\n\n>\n\u00b7k), and we update bnew = b + wk and then remove it.\n\n(\nbias (\n\nl6=k z\u00b7lw\n\n\u2211\n\n\u2211\n\n\u2211\n\n\u2211\n\n\u2211\n\n>\n\u00b7l =\n\n4\n\n\f}\n\n})\n\nAlgorithm 1 The FAB algorithm for LFMs.\n1: Initialize {\u00b5n\n2: while Convergence do\nUpdate P\n3:\naccelerateShrinkage({\u00b5n\n4:\nfor k = 1, . . . , K do\n5:\nUpdate {\u00b5nk} by Eq. (11)\n6:\nend for\n7:\nShrink unnecessary latent features\n8:\nif (Lt \u2212 Lt\u22121)/N < \u03b4 then\n9:\n10:\n11:\n\n{{\u00b5\n0} \u2190 merge({\u00b5n\n}, W\n0\n) = dim(W) then Con-\nif dim(W\nverge\n} \u2190 {\u00b5\nelse {\u00b5n\n0\nn\nend if\n\n}, W \u2190 W\n\n}, W)\n\n0\nn\n\n0\n\n}\n\nck \u2190 ( \u00afX\u2212\u2211\n\n12:\n13:\n14: end while\nAlgorithm 2 accelerateShrinkage\ninput {\u00b5n\n1: for k = 1, . . . , K do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\n\u2212 1\n>\n\u00b7l\nfor t = 1, . . . , Tshrink do\nUpdate {\u00b5nk} by Eq. (11)\nUpdate \u03c0 and \u02dc\u03c0 by Eq. (12)\nend for\n\nl6=k \u00b5\u00b7lw\n\n2 1w\n\n>\n\u00b7k)\u039bw\u00b7k\n\nFigure 1: Time evolution of K (top) and L/N\n(bottom) in FAB with and without shrinkage ac-\nceleration (D = 50 and K = 5). Different lines\nrepresent different random starts.\n\n\u2211\n\n\u2211\n\nThis model shrinkage also works to avoid the ill-conditioning of the FIC; if there are latent fea-\nn \u00b5nk/N = 1), the\ntures that are never activated (\nFIC will no longer be an approximation of the marginal log-likelihood. Algorithm 1 summa-\nrizes whole procedures with respect to the FAB/LFMs. Note that details regarding sub-routines\naccelerateShrinkage() and merge() are explained in Section 3.\n\nn \u00b5nk/N = 0) or always activated (\n\n3 Analysis and Re\ufb01nements\n\n\u2211\n\nCCCP Interpretation and Shrinkage Acceleration Here we interpret the alternating updates\nof \u00b5 and \u02dc\u03c0 as a convex concave procedure (CCCP) [29] and consider to eliminate irrelevant\nfeatures in early steps to reduce computational cost. By substituting an optimality condition\n\u02dc\u03c0k =\n\nn \u00b5nk/N (12) into the lower bound, we obtain\n\n\u2211\n\n\u2211\n\n(\u2211\n\n)\n\nL(q) = \u2212 D\n2\n\nlog\n\n\u00b5nk +\n\n(cn + \u03b7)\n\nk\n\nn\n\nn\n\n>\n\n\u00b5n + H(q)\n\n+ const.\n\n(13)\n\nThe \ufb01rst and second terms are convex and concave with respect to \u00b5nk, respectively. The CCCP\nsolves Eq.(13) by iteratively linearizing the \ufb01rst term around \u00b5t\u22121\nnk . By setting the derivative of the\n\u201clinearized\u201d objective to be zero, we obtain the CCCP update as follows:\n\n(\ncnk + \u03b7(\u03c0k) \u2212 D\n\n\u2211\n\n2\n\nn\n\n)\n\n\u00b5t\u22121\n\nnk\n\n.\n\n(14)\n\n\u00b5t\n\nnk = g\n\n\u2211\nn \u00b5t\u22121\n\nnk into account, Eq.(14) is equivalent to Eq.(11).\n\nBy taking N \u02dc\u03c0k =\nThis new view of the FAB optimization gives us an important insight to accelerate the algorithm.\nBy considering the FAB optimization as the alternating maximization in terms of P and \u00b5 (\u02dc\u03c0 is\nremoved), it is natural to take multiple CCCP steps (14). Such multiple CCCP steps in each FAB-\nEM step is expected to accelerate the shrinkage effect discussed in the previous section because the\n\n5\n\nElapsed time (sec)Estimated K1020304050100200300400Elapsed time (sec)FIC lower bound / N\u2212100\u221290\u221280\u221270\u221260\u221250\u221240\u221230100200300400AccelerationOnOff#Iteration204080160\f\u2211\n\nregularization in terms of \u2212D/2(\nn \u00b5nk) causes the effect. Eventually, it is expected to reduce the\ntotal computational cost since we may be able to remove irrelevant latent features in earlier iterations.\nWe summarize the whole routine of accelerateShrinkage() based on the CCCP in Algorithm 2.\nNote that, in practice, we update \u03c0 along with \u02dc\u03c0 for further acceleration of the shrinkage. We\nempirically con\ufb01rmed that Algorithm 2 signi\ufb01cantly reduced computational costs (see Section 4 and\nFigure 1.) Further discussion of this this update (an exponentiated gradient descent interpretation)\ncan be found in the supplementary materials.\n\nIdenti\ufb01ability and Merge Post-processing Parameter identi\ufb01ability is an important theoretical\naspect in learning algorithms for latent variable models. It has been known [26, 27] that general-\nization error signi\ufb01cantly worsens if the mapping between parameters and functions is not one-to-\none (i.e., is non-identi\ufb01able.) Let us consider the LFM case of K = 2. If w\u00b71 = w\u00b72, then any\ncombination of \u00b5n1 and \u00b5n2 = 2\u00b5 \u2212 \u00b5n1 will have the same representation: Eq[Ex[\u00afxnd|\u03b8d]] =\n\u2211\nwd1(\u00b5n1 + \u00b5n2) = 2wd1\u00b5, and therefore the MLE is non-identi\ufb01able.\nThe following theorem shows that FAB inference resolves such non-identi\ufb01ability in LFMs.\n\u2211\n\u2211\nTheorem 4. Let P\u2217 and q\n\u2217 be stationary points of L such that 0 <\nn \u00b5\n1, . . . , K and |\u00afx\n| < \u221e for k = 1, . . . , K, n = 1, . . . , N. Then, w\n\u2217\n\u2217\n>\n\u00b7k = w\nn \u039b\n\u2217\ncondition of\nnk/N =\nn \u00b5\n\n\u2217\nnk/N < 1 for k =\n\u2217\n\u00b7l is a suf\ufb01cient\n\n\u2217\nnl/N.\nn \u00b5\n\n\u2217\n\u00b7k\n\nw\n\nFor the ill-conditioned situation described above, the FAB algorithm has a unique solution that\nbalances the sizes of latent features. In large sample limit, both FAB and EM reach the same ML\nvalue. The point is, for LFMs, ML solutions are not unique and EM is likely to choose large-K-\nsolutions because of this non-identi\ufb01ability issue. On the other hands, FAB prefers to small-K ML\nsolutions on the basis of the regularizer. In addition, Theorem 4 gives us an important insight about\npost-processing of latent features. If w\n)] is equivalent without\nrelation to \u00b5nk and \u00b5nl, while model complexity is smaller if we only have one latent feature.\n\u2217\n\u2217\nTherefore, if w\n\u00b7k and\n\u00b7k = w\n\u2217\n\u2217\n\u00b7l\n\u00b7k = \u00b5\n. In practice, we search for such overlapping features on the basis of a Euclidean\n\u00b5\n\u2217\ndistance matrix of W\n\u00b7k for k = 1, . . . , K and merge them if the lower bound increases after\nthe post-processing. We empirically found that a few merging operations were likely to occur in real\nworld data sets. The algorithm of merge() is summarized in the supplementary materials.\n\n\u00b7l, merging these two latent features increases L, i.e., w\n\u2217\n\n\u00b7l, then Eq[log p(X, Z|M\u2217\n\u2217\n\n\u2217\n\u00b7k = 2w\n\n\u2217\n\u00b7k = w\n\nand w\n\n\u2217\n\u00b7k+\u00b5\n\n2\n\n\u2217\n\n4 Experiments\n\nWe have evaluated FAB/LFMs in terms of computational speed, model selection accuracy, and pre-\ndiction performance with respect to missing values. We compared FAB inference and the variational\nEM algorithm (see Section 2.2) with an IBP that utilized fast Gibbs sampling [2], a VB [3] having a\n\ufb01nite K, and MEIBP [22]. IBP and MEIBP select a model which maximizes posterior probability.\nFor VB, we performed inference with K = 2, . . . , D and selected the model having the highest free\nenergy. EM selects K using the shrinkage effect of \u03b7 as we have explained in Section 2.2.\nAll the methods were implemented in Matlab (for IBP, VB, and MEIBP, we used original codes\nreleased by the authors), and the computational performance were fairly compared. For FAB and\n} were randomly\nEM, we set \u03b4 = 10\nand uniformly initialized by 0 and 1; the initial number of latent features was set to min(N, D) as\n\u221a\nwell as MEIBP. Since the softwares of IBP, VB, and MEIBP did not learn the standard deviation\n\u03bb in FAB), we \ufb01xed it to 1 for arti\ufb01cial simulations, which is the true standard\nof the noise (1/\ndeviation of toy data, and 0.75 for real data by following the original papers [2, 22]. We set other\nparameters with software default values. For example, \u03b1, a hyperparameter of IBP, was set to 3,\nwhich might cause overestimation of K. As common preprocessing, we normalized X (i.e., the\nsample variance is 1) in all experiments.\n\n\u22124 (this was not sensitive) and Tshrink = 100 (FAB only); {\u00b5n\n\nArti\ufb01cial Simulations We \ufb01rst conducted arti\ufb01cial simulations with fully-observed synthetic data\ngenerated by model (5) having a \ufb01xed \u03bbk = 1 and \u03c0k = 0.5. Figure 1 shows the results of a com-\nparison between FAB with and without shrinkage acceleration.2 Clearly, our shrinkage acceleration\n\n2We also investigated the effect of merge post-processing, but none was observed in this small example.\n\n6\n\n\fFigure 2: Comparative evaluation of the arti\ufb01cial simulations in terms of N v.s. elapsed time (left)\nand selected K (right). Each error-bar shows the standard deviation over 10 trials (D = 30).\n\nFigure 3: Learned Ws in block data.\n\nsigni\ufb01cantly reduced computational cost by eliminating irrelevant features in the early steps, while\nboth algorithms achieved roughly the same objective value L and model selection performance at\nthe convergence. Figure 2 shows the results of a comparison between FAB (with acceleration) and\nthe other methods. While MEIBP was much faster than FAB in terms of elapsed computational time,\nFAB achieved the most accurate estimation of K, especially for large N.\n\nBlock Data We next demonstrate performance of FAB/LFMs in terms of learning features. We\nused the block data, a synthetic data originally used in [10]. Observations were generated by\ncombining four distinct patterns (i.e., K = 4, see Figure 3) with Gaussian noise, on 6 by 6 pixels\n(i.e., D = 36). We prepared the results of N = 2000 samples with the noise standard deviation\n0.3 and no missing values (more results can be found in the supplementary materials.) Figure 3\ncompares estimated features of each method on early learning phase (at the 5th iteration) and after\nthe convergence (the result displayed is the example which has the median log-likelihood over 10\ntrials.) Note that, we omitted MEIPB since we observed that its parameter setting was very sensitive\nfor this data. While EM and IBP retain irrelevant features, FAB successfully extracts the true patterns\nwithout irrelevant features.\n\nReal World Data We \ufb01nally evaluated predictive performance by using the real data sets described\nin Table 1. We randomly removed 30% of data with 5 different random seeds and treated them as\nmissing values, and we measured predictive and training log-likelihood (PLL and TLL) for them.\nTable 1 summarizes the results with respect to elapsed computational time (hours), selected K,\nPLL, and TLL. Note that, for cases when the computational time for a method exceeded 50 hours,\nwe stopped the program after that iteration.3 Since MEIBP is the method for non-negative data, we\nomitted the results of those containing negative values. Also, since MEIBP did not \ufb01nish the \ufb01rst\niteration within 50 hours for yaleB and USPS data, we set the initial K as 100. FAB consistently\nachieved good predictive performance (higher PLL) with low computational cost. Although MEIBP\nperformed faster than FAB with appropriately set the initial value of K (i.e., yaleB and USPS),\nPLLs of FAB were much better than those of MEIBP. In terms of K, FAB typically achieved a\nmore compact and better model representation than the others (smaller K). Another important\nobservation is that FAB have much smaller differences between TLL and PLL than the others. This\nsuggests that FAB\u2019s unique regularization worked well for mitigating over-\ufb01tting. For the large\nsample data sets (EEG, Piano, USPS), PLLs of FAB and EM were competitive with one another;\n\n3We totally omitted VB because of its long computational time.\n\n7\n\nNElapsed time (sec)100.5101101.5102102.5103True K=5100250500100020001010025050010002000fabemibpmeibpvbNEstimated K510152025305100250500100020001010025050010002000\fTable 1: Results on real-world data sets. The best result (e.g., the smallest K in model selection)\nand those not signi\ufb01cantly worse than it are highlighted in boldface. We used a one-side t-test with\n95% con\ufb01dence.\n*We exclude the results of MEIBP for yaleB and USPS from the t-test because of the\ndifferent experimental settings (initial K was smaller than the others. See the body text for details.)\n\nK\n\nPLL\n\nData\nSonar [4]\n208 \u00d7 49\n\nLibras [4]\n360 \u00d7 90\n\nAuslan [14]\n16180 \u00d7 22\n\nEEG [12]\n120576 \u00d7 32\n\nPiano [21]\n57931 \u00d7 161\n\nyaleB [7]\n2414 \u00d7 1024\n\nUSPS [13]\n110000 \u00d7 256\n\n\u2217\n\n\u2217\n\nTLL\n\nMethod Time (h)\n4.4 \u00b1 1.1 \u22121.25 \u00b1 0.02 \u22121.14 \u00b1 0.03\nFAB < 0.01\n48.8 \u00b1 0.5 \u22124.04 \u00b1 0.46 \u22120.08 \u00b1 0.07\nEM < 0.01\n69.6 \u00b1 4.8 \u22124.48 \u00b1 0.15\n0.13 \u00b1 0.02\nIBP\n3.3\n45.4 \u00b1 1.7 \u221218.10 \u00b1 1.90 \u221215.60 \u00b1 1.80\nMEIBP < 0.01\nFAB < 0.01 19.0 \u00b1 0.7 \u22120.63 \u00b1 0.03 \u22120.42 \u00b1 0.03\n75.6 \u00b1 8.6 \u22120.68 \u00b1 0.11\n0.76 \u00b1 0.24\nEM\n0.01\n0.13 \u00b1 0.01\n36.4 \u00b1 1.1 \u22120.18 \u00b1 0.01\nIBP\n4.8\n40.8 \u00b1 1.3 \u221211.30 \u00b1 2.00 \u221210.70 \u00b1 1.80\nMEIBP\n0.05\n6.0 \u00b1 0.7 \u22121.34 \u00b1 0.15 \u22120.92 \u00b1 0.02\nFAB\n0.04\n\u22121.79 \u00b1 0.27 \u22120.78 \u00b1 0.02\n22 \u00b1 0\nEM\n0.2\n73 \u00b1 5\n0.08 \u00b1 0.01\n\u22124.54 \u00b1 0.08\nIBP\n50.2\nN/A\nMEIBP\nN/A\n1.6 11.2 \u00b1 1.6 \u22120.93 \u00b1 0.02 \u22120.76 \u00b1 0.04\nFAB\n32 \u00b1 0\n\u22120.88 \u00b1 0.09 \u22120.59 \u00b1 0.01\nEM\n3.7\n46.4 \u00b1 4.4 \u22123.16 \u00b1 0.03 \u22120.26 \u00b1 0.05\nIBP\n53.0\nN/A\nMEIBP\n58.0 \u00b1 3.5 \u22120.83 \u00b1 0.01 \u22120.63 \u00b1 0.02\nFAB\n19.4\n50.1 158.6 \u00b1 3.4 \u22120.82 \u00b1 0.02 \u22120.45 \u00b1 0.01\nEM\n89.6 \u00b1 4.2 \u22121.83 \u00b1 0.02 \u22120.84 \u00b1 0.05\nIBP\n55.8\n14.3 48.4 \u00b1 3.2 \u22127.14 \u00b1 0.52 \u22126.90 \u00b1 0.50\nMEIBP\n2.2 77.2 \u00b1 7.9 \u22120.37 \u00b1 0.02 \u22120.29 \u00b1 0.03\nFAB\n929 \u00b1 20\n0.80 \u00b1 0.27\nEM\n50.9\n94.2 \u00b1 7.5 \u22120.54 \u00b1 0.02 \u22120.35 \u00b1 0.02\nIBP\n51.7\n69.8 \u00b1 2.7 \u22121.18 \u00b1 0.02 \u22121.12 \u00b1 0.02\nMEIBP\n7.2\n11.2 110.2 \u00b1 5.1 \u22120.96 \u00b1 0.01 \u22120.64 \u00b1 0.02\nFAB\n256 \u00b1 0\n\u22121.06 \u00b1 0.01 \u22120.36 \u00b1 0.01\nEM\n45.7\n61.6 181.0 \u00b1 4.8 \u22122.59 \u00b1 0.08 \u22120.76 \u00b1 0.01\nIBP\n22.0 \u00b1 2.7 \u22121.35 \u00b1 0.03 \u22121.31 \u00b1 0.03\nMEIBP\n1.9\n\n\u22124.60 \u00b1 1.20\n\nN/A\n\nN/A\n\nN/A\n\nN/A\n\nN/A\n\nthis is reasonable, for large N, both of them ideally achieve the maximum likelihood while FAB\nachieved much smaller K (see identi\ufb01ability discussion in Section 3). In small N scenarios, on the\nother hand, FIC approximation would be not accurate, and FAB would perform worse than NPBs\n(while we observed such case only in Libras.)\n\n5 Summary\n\nWe have considered here an FAB framework for LFMs that offers fully automated model selection,\ni.e., selecting the number of latent features. While LFMs do not satisfy the assumptions that naturally\ninduce FIC/FAB on MMs, we have shown that they have the same \u201cdegree\u201d of model complexity as\nthe approximated marginal log-likelihood, and we have derived FIC/FAB in a form similar to that\nfor MMs. In addition, our proposed accelerating mechanism for shrinking models drastically re-\nduces total computational time. Experimental comparisons of FAB inference with existing methods,\nincluding state-of-the-art IBP methods, have demonstrated the superiority of FAB/LFM.\n\nAcknowledgments\n\nThe authors would like to thank Finale Doshi-Velez for providing Piano and EEG data sets. This\nwork was supported by JSPS KAKENHI Grant Number 25880028.\n\n8\n\n\fReferences\n[1] T. Broderick, B. Kulis, and M. I. Jordan. MAD-Bayes: MAP-based Asymptotic Derivations from Bayes.\n\nIn ICML, 2013.\n\nprocess. In AISTATS, 2009.\n\n[2] F. Doshi-Velez and Z. Ghahramani. Accelerated sampling for the indian buffet process. In ICML, 2009.\n[3] F. Doshi-Velez, K. T. Miller, J. Van Gael, and Y. W. Teh. Variational inference for the Indian buffet\n\n[4] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[5] R. Fujimaki and K. Hayashi. Factorized asymptotic bayesian hidden markov model. In ICML, 2012.\n[6] R. Fujimaki and S. Morinaga. Factorized asymptotic bayesian inference for mixture modeling. In AIS-\n\nTATS, 2012.\n\n[7] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models\nfor face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 23:643\u2013660, 2001.\n\n[8] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, 1995.\n[9] Z. Ghahramani, T. L. Grif\ufb01ths, and P. Sollich. Bayesian nonparametric latent feature models (with dis-\n\ncussion). In 8th Valencia International Meeting on Bayesian Statistics, 2006.\n\n[10] T. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the indian buffet process, 2005.\n[11] T. L. Grif\ufb01ths and Z. Ghahramani. The indian buffet process: An introduction and review. JMLR,\n\n12:1185\u20131224, 2011.\n\n[12] U. Hoffmann, G. Garcia, J. M. Vesin, K. Diserens, and T. Ebrahimi. A boosting approach to p300\nIn International IEEE EMBS Conference on\n\ndetection with application to brain-computer interfaces.\nNeural Engineering, pages 97\u2013100. 2005.\n\n[13] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 16(5):550\u2013554, 1994.\n\n[14] M. W. Kadous. Temporal Classi\ufb01cation: Extending the Classi\ufb01cation Paradigm to Multivariate Time\nSeries. PhD thesis, School of Computer Science & Engineering, University of New South Wales, 2002.\n[15] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.\n\nInformation and Computation, 132(1):1\u201363, 1997.\n\n[16] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction. In NIPS,\n\n[17] K. T. Miller. Bayesian Nonparametric Latent Feature Models. PhD thesis, University of California,\n\n[18] S. Nakajima, M. Sugiyama, and D. Babacan. On bayesian PCA: Automatic dimensionality selection and\n\n[19] K. Palla, D. A. Knowles, and Z. Ghahramani. An in\ufb01nite latent attribute model for network data.\n\nIn\n\n[20] C. Peterson and J. Anderson. A mean \ufb01eld theory learning algorithm for neural networks. Complex\n\n[21] G. E. Poliner and D. P. W. Ellis. A discriminative model for polyphonic piano transcription. EURASIP\n\nJournal of Advances in Signal Processing, 2007(1):154, 2007.\n\n[22] C. Reed and Z. Ghahramani. Scaling the indian buffet process via submodular maximization. In ICML,\n\n2009.\n\nBerkeley, 2011.\n\nanalytic solution. In ICML, 2011.\n\nICML, 2012.\n\nsystems, 1:995\u20131019, 1987.\n\n[23] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461\u2013464, 1978.\n[24] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical\n\nSociety. Series B, 61(3):611\u2013622, 1999.\n\n[25] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[26] S. Watanabe. Algebraic analysis for nonidenti\ufb01able learning machines. Neural Computation, 13(4):899\u2013\n\n[27] S. Watanabe. Algebraic Geometry and Statistical Learning Theory (Cambridge Monographs on Applied\n\nand Computational Mathematics). Cambridge University Press, 2009.\n\n[28] R. Wong. Asymptotic Approximation of Integrals (Classics in Applied Mathematics). SIAM, 2001.\n[29] A. L. Yuille and A. Rangarajan. The Concave-Convex procedure. Neural Computation, 15(4):915\u2013936,\n\n2013.\n\n933, 2001.\n\n2003.\n\n[30] R. S. Zemel and G. E. Hinton. Learning population codes by minimizing description length. Neural\n\nComputation, 7(3):11\u201318, 1994.\n\n9\n\n\f", "award": [], "sourceid": 626, "authors": [{"given_name": "Kohei", "family_name": "Hayashi", "institution": "NII"}, {"given_name": "Ryohei", "family_name": "Fujimaki", "institution": "NEC Labs America"}]}