{"title": "Variational Bayesian Stochastic Complexity of Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": null, "full_text": "Variational Bayesian Stochastic Complexity of Mixture Models\nKazuho Watanabe Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mail Box:R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503, Japan kazuho23@pi.titech.ac.jp\n\nSumio Watanabe P& I Lab. Tokyo Institute of Technology swatanab@pi.titech.ac.jp\n\nAbstract\nThe Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning.\n\n1\n\nIntro duction\n\nThe Variational Bayesian (VB) framework has been widely used as an approximation of the Bayesian learning for models involving hidden (latent) variables such as mixture models[2][4]. This framework provides computationally tractable posterior distributions with only modest computational costs in contrast to Markov chain Monte Carlo (MCMC) methods. In many applications, it has performed better generalization compared to the maximum likelihood estimation. In spite of its tractability and its wide range of applications, little has been done to investigate the theoretical properties of the Variational Bayesian learning itself. For example, questions like how accurately it approximates the true one remained unanswered until quite recently. To address these issues, the stochastic complexity in the Variational Bayesian learning of gaussian mixture models was clarified and the accuracy of the Variational Bayesian learning was discussed[10].\n This work was supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for JSPS Fellows 4637 and for Scientific Research 15500130, 2005.\n\n\f\nIn this paper, we focus on the Variational Bayesian learning of more general mixture models, namely the mixtures of exponential families which include mixtures of distributions such as gaussian, binomial and gamma. Mixture models are known to be non-regular statistical models due to the non-identifiability of parameters caused by their hidden variables[7]. In some recent studies, the Bayesian stochastic complexities of non-regular models have been clarified and it has been proven that they become smaller than those of regular models[12][13]. This indicates an advantage of the Bayesian learning when it is applied to non-regular models. As our main results, the asymptotic upper and lower bounds are obtained for the stochastic complexity or the free energy in the Variational Bayesian learning of the mixture of exponential families. The stochastic complexity is important quantity for model selection and giving the asymptotic form of it also contributes to the following two issues. One is the accuracy of the Variational Bayesian learning as an approximation method since the stochastic complexity shows the distance from the variational posterior distribution to the true Bayesian posterior distribution in the sense of Kullback information. Indeed, we give the asymptotic form of the stochastic complexity as F (n)  log n where n is the sample size, by comparing the coefficient  with that of the true Bayesian learning, we discuss the accuracy of the VB approach. Another is the influence of the hyperparameter on the learning process. Since the Variational Bayesian algorithm is a procedure of minimizing the functional that finally gives the stochastic complexity, the derived bounds indicate how the hyperparameters influence the process of the learning. Our results have an implication for how to determine the hyperparameter values before the learning process. We consider the case in which the true distribution is contained in the learner model. Analyzing the stochastic complexity in this case is most valuable for comparing the Variational Bayesian learning with the true Bayesian learning. This is because the advantage of the Bayesian learning is typical in this case[12]. Furthermore, this analysis is necessary and essential for addressing the model selection problem and hypothesis testing. The paper is organized as follows. In Section 2, we introduce the mixture of exponential family model. In Section 3, we describe the Bayesian learning. In Section 4, the Variational Bayesian framework is described and the variational posterior distribution for the mixture of exponential family model is derived. In Section 5, we present our main result. Discussion and conclusion follow in Section 6.\n\n2\n\nMixture of Exp onential Family\n\nDenote by c(x|b) a density function of the input x  RN given an M -dimensional parameter vector b = (b(1) , b(2) ,    , b(M ))T  B where B is a subset of RM . The general mixture model p(x|) with a parameter vector  is defined by p(x|) = kK\n=1\n\nak c(x|bk ),\n\nK where integer K is the number of components and {ak |ak  0, k =1 ak = 1} is the set of mixing proportions. The model parameter  is {ak , bk }K 1. k= A mixture model is called a mixture of exponential family (MEF) model or exponential family mixture model if the probability distribution c(x|b) for each component is given by the following form, c(x|b) = exp{b  f (x) + f0 (x) - g(b)}, (1)\n\n\f\nwhere b  B is called the natural parameter, b  f (x) is its inner product with the vector f (x) = (f1 (x),    , fM (x))T , f0 (x) and g(b) are real-valued functions of the input x and the parameter b, respectively[3]. Suppose functions f1 ,    , fM and a constant function are linearly independent, which means the effective number of parameters in a single component distribution c(x|b) is M . The conjugate prior distribution () for the MEF model is given by the product of the following two distributions on a = {ak }K 1 and b = {bk }K 1, k= k= (a) = (b) = (K 0 ) k (0 )K kK\n=1 K =1  ak 0 -1 ,\n\n(2)\n\n(bk ) =\n\nkK exp{0 (bk  0 - g(bk ))} , C (0 , 0)\n=1\n\n(3)\n\nwhere 0 > 0, 0  RM and 0 > 0 are constants called hyperparameters and e C ( , ) = xp{ (  b - g(b))}db is a function of   R and   RM .\n\n(4)\n\nThe mixture model can be rewritten as follows by using a hidden variable y = (y1 ,    , yK )  {(1, 0,    , 0), (0, 1,    , 0),    , (0, 0,    , 1)}, p(x, y|) = kK a\n=1 k c(x|bk )\n\nyk .\n\nIf and only if the datum x is generated from the k th component, yk = 1.\n\n3\n\nThe Bayesian Learning\n\nSuppose n training samples X n = {x1 ,    , xn } are independently and identically taken from the true distribution p0 (x). In the Bayesian learning of a model p(x| ) whose parameter is , first, the prior distribution () on the parameter  is set. Then the posterior distribution p(|X n ) is computed from the given dataset and the prior by 1 p(|X n ) = (5) exp(-nHn ())(), Z (X n ) where Hn () is the empirical Kullback information, Hn () =\nn 1i p0 (xi ) log , n =1 p(xi |)\n\n(6)\n\nand Z (X n ) is the normalization constant that is also known as the marginal likelihood or the evidence of the dataset X n [6]. The Bayesian predictive distribution p(x|X n ) is given by averaging the model over the posterior distribution as follows, p p(x|X n ) = (x| )p(|X n )d. (7) The stochastic complexity F (X n ) is defined by F (X n ) = - log Z (X n ), (8)\n\n\f\nwhich is also called the free energy and is important in most data modelling problems. Practically, it is used as a criterion by which the model is selected and the hyperparameters in the prior are optimized[1][9]. Define the average stochastic complexity F (n) by , F F (n) = EX n (X n )\n\n(9)\n\nwhere EX n [] denotes the expectation value over all sets of training samples. Recently, it was proved that F (n) has the following asymptotic form[12], F (n)  log n - (m - 1) log log n + O (1), (10) where  and m are the rational number and the natural number respectively which are determined by the singularities of the set of true parameters. In regular statistical models, 2 is equal to the number of parameters and m = 1, whereas in non-regular models such as mixture models, 2 is not larger than the number of parameters and m  1. This means an advantage of the Bayesian learning. However, in the Bayesian learning, one computes the stochastic complexity or the predictive distribution by integrating over the posterior distribution, which typically cannot be performed analytically. As an approximation, the VB framework was proposed[2][4].\n\n4\n4.1\n\nThe Variational Bayesian Learning\nThe Variational Bayesian Framework\n\nIn the VB framework, the Bayesian posterior p(Y n , |X n ) of the hidden variables and the parameters is approximated by the variational posterior q (Y n , |X n ), which factorizes as (11) q (Y n , |X n ) = Q(Y n |X n )r(|X n ), n n where Q(Y |X ) and r (|X n ) are posteriors on the hidden variables and the parameters respectively. The variational posterior q (Y n , |X n ) is chosen to minimize the functional F [q ] defined by q Y q (Y n , |X n )p0 (X n ) (Y n , |X n ) log F [q ] = d, (12) p(X n , Y n , ) n = F (X n ) + K (q (Y n , |X n )||p(Y n , |X n )),\nn n n n\n\n(13)\n\nwhere K (q (Y , |X )||p(Y , |X )) is the Kullback information between the true Bayesian posterior p(Y n , |X n ) and the variational posterior q (Y n , |X n ) 1 . This leads to the following theorem. The proof is well known[8]. Theorem 1 If the functional F [q ] is minimized under the constraint (11) then the variational posteriors, r (|X n ) and Q(Y n |X n ), satisfy l Q 1 () exp og p(X n , Y n |) (Y n |X n ) , (14) r (|X n ) = Cr l r 1 Q(Y n |X n ) = exp og p(X n , Y n |) (|X n ) , (15) CQ\nK (q(x)||p(x)) denotes the Kullback information from a distribution q(x) to a distribution p(x), that is, q q(x) K (q(x)||p(x)) = (x) log dx. p(x)\n1\n\n\f\nwhere Cr and CQ are the normalization constants2 . We define the stochastic complexity in the VB learning F (X n ) by the minimum value of the functional F [q ] , that is , F (X n ) = min F [q ],\nr,Q\n\nwhich shows the accuracy of the VB approach as an approximation of the Bayesian learning. F (X n ) is also used for model selection since it gives an upper bound of the true Bayesian stochastic complexity F (X n ). 4.2 Variational Posterior for MEF Model\n\nIn this subsection, we derive the variational posterior r (|X n ) for the MEF model based on (14) and then define the variational parameter for this model. Using the complete data {X n , Y n } = {(x1 , y1 ),    , (xn , yn )}, we put n in 1i k y k = yi Q(Y n ) , nk = y k , and k = yk f (xi ), i i n k =1 i =1\nk where yi = 1 if and only if the ith datum xi is from the k th component. The variable nk is the expected number of the data that are estimated to be from the k th component. From (14) and the respective prior (2) and (3), the variational posterior r () is obtained as the product of the following two distributions3 ,\n\nr (a) = kK\n=1\n\n(n + K 0 ) k n ak k +0 -1 , K (nk + 0 ) =1 k =1 kK\n=1\n\nK\n\n(16)\n\nr (b) = where k =\n\nr (bk ) =\n\n1 exp{k (k  bk - g(bk ))}, C (k , k )\n\n(17)\n\nnk k +0 0 nk +0\n\nand k = nk + 0 . Let ak = ak r(a) =\n\nnk +  0 , (18) n + K 0 1  log C (k , k ) b k = b k r (b k ) = , (19) k  k and define the variational parameter  by  =  r() = {ak , bk }K 1. Then it is k= noted that the variational posterior r () and CQ in (15) are parameterized by the variational parameter  . Therefore, we denote them as r (| ) and CQ () henceforth. We define the variational estimator vb by the variational parameter  that attains the minimum value of the stochastic complexity F (X n ). Then, putting (15) into (12), we obtain F (X n )\nn\n\n= = n\n\nmin{K (r(| )||()) - (log CQ( ) + S (X n ))},\n\n\n(20) (21)\n\nwhere S (X ) = -\n\nK (r(| vb )||()) - (log CQ ( vb ) + S (X n )), i=1 log p0 (x).\n\nTherefore, our aim is to evaluate the minimum value of (20) as a function of the variational parameter  .\n p(x) denotes the expectation over p(x). Hereafter, we omit the condition X n of the variational posteriors, and abbreviate them to q(Y n , ), Q(Y n ) and r().\n3 2\n\n\f\n5\n\nMain Result\nF (n) = EX n [F (X n )].\n\nThe average stochastic complexity F (n) in the VB learning is defined by (22)\n\nWe assume the following conditions. (i) The true distribution p0 (x) is an MEF model p(x|0 ) which has K0 components and the parameter 0 = {a , b }K0 1, k k k= p(x|0 ) = where  R and has K components, b k\nM K k0 =1\n\na exp{b  f (x) + f0 (x) - g(b )}, k k k\n\nb k\n\n= b (k = j ). And suppose that the model p(x| ) j kK\n=1\n\np(x| ) =\n\nak exp{bk  f (x) + f0 (x) - g(bk )},\n\nand K  K0 holds. (ii) The prior distribution of the parameters is () = (a)(b) given by (2) and (3) with (b) bounded. (iii) Regarding the distribution c(x|b) of each component, the Fisher information 2 g( matrix I (b) =  bb) satisfies 0 < |I (b)| < +, for arbitrary b  B 4 . The b function   b - g(b) has a stationary point at ^ in the interior of B for each b  g (b)   {  b |b  B }. Under these conditions, we prove the following. Theorem 2 (Main Result) Assume the conditions (i),(ii) and (iii). Then the average stochastic complexity F (n) defined by (22) satisfies + n  log n + EX n Hn ( vb ) C1  F (n)   log n + C2 , (23) for an arbitrary natural number n, where C1 , C2 are constants independent of n and ( ( + K - 1)0 + M , K - K0 )0 + M K0 +K0 -1 (0  M2 1 ), 2 2 = = (24) M K +K -1 M K +K -1 + , (0 > M2 1 ). 2 2 This theorem shows the asymptotic form of the average stochastic complexity in the Variational Bayesian learning. The coefficients ,  of the leading terms are identified by K ,K0 , that are the numbers of components of the learner and the true distribution, the number of parameters M of each component and the hyperparameter 0 of the conjugate prior given by (2). n n In this theorem, nHn (vb ) = - i=1 log p(xi |vb )-S (X n ), and - i=1 log p(xi | vb ) is a nraining error which is computable during the learning. If the term t i EX n Hn ( vb ) s a bounded function of n, then it immediately follows from this theorem that  log n + O (1)  F 0 (n)   log n + O (1),\ndenotes the matrix whose ij th entry is of a matrix.\n4  2 g (b)  b b  2 g (b)  b(i)  b(j)\n\nand |  | denotes the determinant\n\n\f\nwhere O (1) is a bounded function of n. In certain cases, such as binomial mixtures and mixtures of von-Mises distributions, it is actually a bounded function of n. In the case of gaussian mixtures, if B = RN , it is conjectured that the minus likelihood ratio min nHn (), a lower bound of nHn ( vb ), is at most of the order of log log n[5]. Since the dimension of the parameter  is M K + K - 1, the average stochastic complexity of regular statistical models, which coincides with the Bayesian information criterion (BIC)[9] is given by BIC log n where BIC = M K +K -1 . Theorem 2 2 claims that the coefficient  of log n is smaller than BIC when 0  (M + 1)/2. This implies that the advantage of non-regular models in the Bayesian learning still remains in the VB learning. (Outline of the proof of Theorem 2) From the condition (iii), calculating C (k , k ) in (17) by the saddle point approximation, K (r(| )||()) in (20) is evaluated as follows 5 , K (r(| )||()) = G(a) - kK\n=1\n\nlog (bk ) + Op (1),\n\n(25)\n\nwhere the function G(a) of a = {ak }K 1 is given by k= G(a) = MK + K - 1 1k M log n + { - (0 - )} 2 2 2\nK\n\nlog ak .\n=1\n\n(26)\n\nThen log CQ ( ) in (20) is evaluated as follows. nHn ( ) + Op(1)  -(log CQ () + S (X n ))  nH n () + Op (1) where H n ( ) =\nn 1i log n =1\n\n(27)\n\np(xi |0 ) - , C k ) exp n k =1 ak c(xi |b k +min{0 ,0 } K\n\nand C is a constant. Thus, from (20), evaluating the right-hand sides of (25) and (27) at specific points near the true parameter 0 , we obtain the upper bound in (23). The lower bound in (23) is obtained from (25) and (27) by Jensen's inequality K and the constraint k =1 ak = 1. (Q.E.D)\n\n6\n\nDiscussion and Conclusion\n\nIn this paper, we showed the upper and lower bounds of the stochastic complexity for the mixture of exponential family models in the VB learning. Firstly, we compare the stochastic complexity shown in Theorem 2 with the one in the true Bayesian learning. On the mixture models with M parameters in each component, the following upper bound for the coefficient of F (n) in (10) is known [13], ( K + K0 - 1)/2 (M = 1),  (28) (K - K0 ) + (M K0 + K0 - 1)/2 (M  2). By the certain conditions about the prior distribution under which the above bound was derived, we can compare the stochastic complexity when 0 = 1. Putting 0 = 1 in (24), we have  = K - K0 + (M K0 + K0 - 1)/2. (29)\n5\n\nOp (1) denotes a random variable bounded in probability.\n\n\f\nSince we obtain F (n)  log n + O (1) under certain assumptions[11], let us compare  of the VB learning to  in (28) of the true Bayesian learning. When M = 1, that is, each component has one parameter,    holds since K0  K . This means that the more redundant components the model has, the more the VB learning differs from the true Bayesian learning. In this case, 2 is equal to the number of the parameters of the model. Hence the BIC[9] corresponds to  log n when M = 1. If M  2, the upper bound of  is equal to . This implies that the variational posterior is close to the true Bayesian posterior when M  2. More precise discussion about the accuracy of the approximation can be done for models on which tighter bounds or exact values of the coefficient  in (10) are given[10]. Secondly, we point out that Theorem 2 shows how the hyperparameter 0 influence the process of the VB learning. The coefficient  in (24) indicates that only when 0  (M + 1)/2, the prior distribution (2) works to eliminate the redundant components that the model has and otherwise it works to use all the components. And lastly, let us give examples of how to use the theoretical bounds in (23). One can examine experimentally whether the actual iterative algorithm converges to the optimal variational posterior instead of local minima by comparing the stochastic complexity with our theoretical result. The theoretical bounds would also enable us to compare the accuracy of the VB learning with that of the Laplace approximation or the MCMC method. As mentioned in Section 4, our result will be important for developing effective model selection methods using F (X n ) in the future work.\n\nReferences\n[1] H.Akaike, \"Likelihood and Bayes procedure,\" Bayesian Statistics, (Bernald J.M. eds.) University Press, Valencia, Spain, pp.143-166, 1980. [2] H.Attias, \"Inferring parameters and structure of latent variable models by variational bayes,\" Proc. of UAI, 1999. [3] L.D.Brown, \"Fundamentals of statistical exponential families,\" IMS Lecture NotesMonograph Series, 1986. [4] Z.Ghahramani, M.J.Beal, \"Graphical models and variational methods,\" Advanced Mean Field Methods , MIT Press, 2000. [5] J.A.Hartigan, \"A Failure of likelihood asymptotics for normal mixtures,\" Proc. of the Berkeley Conference in Honor of J.Neyman and J.Kiefer, Vol.2, 807-810, 1985. [6] D.J. Mackay, \"Bayesian interpolation,\" Neural Computation, 4(2), pp.415-447, 1992. [7] G.McLachlan, D.Peel,\"Finite mixture models,\" Wiley, 2000. [8] M.Sato, \"Online model selection based on the variational bayes,\" Neural Computation, 13(7), pp.1649-1681, 2001. [9] G.Schwarz, \"Estimating the dimension of a model,\" Annals of Statistics, 6(2), pp.461464, 1978. [10] K.Watanabe, S.Watanabe, \"Lower bounds of stochastic complexities in variational bayes learning of gaussian mixture models,\" Proc. of IEEE CIS04, pp.99-104, 2004. [11] K.Watanabe, S.Watanabe, \"Stochastic complexity for mixture of exponential families in variational bayes,\" Proc. of ALT05, pp.107-121, 2005. [12] S.Watanabe,\"Algebraic analysis for non-identifiable learning machines,\" Neural Computation, 13(4), pp.899-933, 2001. [13] K.Yamazaki, S.Watanabe, \"Singularities in mixture models and upper bounds of stochastic complexity,\" Neural Networks, 16, pp.1029-1038, 2003.\n\n\f\n", "award": [], "sourceid": 2935, "authors": [{"given_name": "Kazuho", "family_name": "Watanabe", "institution": null}, {"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}