{"title": "Parameter Expanded Variational Bayesian Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1097, "page_last": 1104, "abstract": null, "full_text": "Parameter Expanded Variational Bayesian Methods\n\nYuan (Alan) Qi MIT CSAIL 32 Vassar street Cambridge, MA 02139 alanqi@csail.mit.edu\n\nTommi S. Jaakkola MIT CSAIL 32 Vassar street Cambridge, MA 02139 tommi@csail.mit.edu\n\nAbstract\nBayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination.\n\n1 Introduction\nA number of approximate Bayesian methods have been proposed to offset the high computational cost of exact Bayesian calculations. Variational Bayes (VB) is one popular method of approximation. Given a target probability distribution, variational Bayesian methods approximate the target distribution with a factored distribution. While factoring omits dependencies present in the target distribution, the parameters of the factored approximation can be adjusted to improve the match. Specifically, the approximation is optimized by minimizing the KL-divergence between the factored distribution and the target. This minimization can be often carried out iteratively, one component update at a time, despite the fact that the target distribution may not lend itself to exact Bayesian calculations. Variational Bayesian approximations have been widely used in Bayesian learning (e.g., (Jordan et al., 1998; Beal, 2003; Bishop & Tipping, 2000)). Variational Bayesian methods nevertheless suffer from slow convergence when the variables in the factored approximation are actually strongly coupled in the original model. The same problem arises in popular Gibbs sampling algorithm. The sampling process converges slowly in cases where the variables are strongly correlated. The slow convergence can be alleviated by data augmentation (van Dyk & Meng, 2001; Liu & Wu, 1999), where the idea is to identify an optimal reparameterization (within a family of possible reparameterizations) so as to remove coupling. Similarly, in a deterministic context, Liu et al. (1998) proposed over-parameterization of the model to speed up EM convergence. Our work here is inspired by DA sampling and PX-EM. Our approach uses auxiliary parameters to speed up the deterministic approximation of the target distribution. Specifically, we propose Parameter-eXpanded Variational Bayesian (PX-VB) method. The original model is modified by auxiliary parameters that are optimized in conjunction with the variational approximation. The optimization of the auxiliary parameters corresponds to a parameterized joint\n\n\f\noptimization of the variational components; the role of the new updates is to precisely remove otherwise strong functional couplings between the components thereby facilitating fast convergence.\n\n2 An illustrative example\nConsider a toy Bayesian model, which has been considered by Liu and Wu (1999) for sampling. p(y |w, z ) = N (y | w + z , 1), p(z ) = N (z | 0, D) (1) where D is a know hyperparameter and p(w)  1. The task is to compute the posterior distribution of w. Suppose we use a VB method to approximate p(w|y ), p(z |y ) and p(w, z |y ) by q (w), q (z ) and q (w, z ) = q (w)q (z ), respectively. The approximation is optimized by minimizing K L(q (w)q (z ) p(y|w, z )p(z )) (the second argument need not be normalized). The general forms of the component updates are given by q (w)  exp( ln p(y |w, z )p(z ) q(z) ) (2) q (z )  exp( ln p(y |w, z )p(z ) q(w) ) It is easy to derive the updates in this case: y- w 1 q (w) = N (w|y - z , 1) q (z ) = N (z | , ) 1 + D-1 1 + D-1 (3)\n\n(4)\n\nNow let us analyze the convergence of the mean parameter of q (w), w = y - z . Iteratively, ( = w D-1 y+ = D-1 1 + D-1 )-1 y + (1 + D-1 )-2 y +    y. w= -1 -1 1+D 1+D The variational estimate w converges to y , which actually is the true posterior mean (For this toy problem, p(w|y ) = N (w|y , 1 + D)). Furthermore, if D is large, w converges slowly. Note that the variance parameter of q (w) converges to 1 in one iteration, though underestimates the true posterior variance 1 + D. Intuitively, the convergence speed of w and q (w) suffers from strong coupling between the updates of w and z . In other words, the update information has to go through a feedback loop w  z  w    . To alleviate the coupling, we expand the original model with an additional parameter : p(y |w, z ) = N (y | w + z , 1) p(z |) = N (z | , D) (5) The expanded model reduces to the original one when  equals the null value 0 = 0. Now having computed q (z ) given  = 0, we minimize K L(q (w)q (z ) p(y|w, z )p(z |)) over  and obtain the minimizer  = z . Then, we reduce the expanded model to the original one by applying the reduction rule z new = z -  = z - z , wnew = w +  = w + z . Correspondingly, we change the measures of q (w) and q (z ): 1 q (w + z )  q (wnew ) = N (wnew |y , 1) q (z - z )  q (z new ) = N (z new |0, ) (6) 1 + D-1 Thus, the PX-VB method converges. Here  breaks the update loop between q (w) and q (z ) and plays the role of a correction force; it corrects the update trajectories of q (w) and q (z ) and makes them point directly to the convergence point.\n\n3 The PX-VB Algorithm\n^ In the general PX-VB formulation, we over-parameterize the model p(x, D) to get p (x, D), where the original model is recovered for some default values of the auxiliary parameters  = 0 . The algorithm consists of the typical VB updates relative to p (x, D), the optimization of auxiliary parameters , as well as a reduction step to turn the model back to the original form where  = 0 . This last reduction step has the effect of jointly modifying the components of the factored variational approximation. Put another way, we push the change in p (x, D), due to the optimization of , into the variational approximation instead. Changing the variational approximation in this manner permits us to return the model into its original form and set  = 0 . ^ Specifically, we first expand p(x, D) to obtain p (x, D). Then at the tth iteration,\n\n\f\ns 1. q (xs ) are updated sequentially. Note that the approximate distribution q (x) = q (xs ). 2. We minimize K L(q (x) p (x, D)) over the auxiliary parameters . This optimization can be done jointly with some components of the variational distribution, if feasible. 3. The expanded model is reduced to the original model through reparameterization. Accord^ ingly, we change q (t+1) (x) to q (t+1) (x) such that ^ ^ K L(q (t+1) (x) p0 (x, D)) = K L(q (x) p(t+1) (x, D)) ^ where q (t+1) (x) are the modified components of the variational approximation. 4. Set  = 0 . Since each update of PX-VB decreases or maintains the KL divergence K L(q (x) p(x, D)), which is lower bounded, PX-VB reaches a stationary point for K L(q (x) p(x, D)). Empirically, PX-VB often achieves solution similar to what VB achieves, with faster convergence. A simple strategy to implement PX-VB is to use a mapping S , parameterized by , over the vari^ ables x. After sequentially optimizing over the components {q (xs )}, we maximize ln p (x) q(x) ^ ^ over . Then, we reduce p (x, D) to p(x, D) and q (x) to q (x) through the inverse mapping of S , - ^ M  S 1 . Since we optimize  after optimizing {q (xs }, the mapping S should change at least two components of x. Otherwise, the optimization over  will do nothing since we have already ^ optimized over each q (xs ). If we jointly optimize  and one component q (xs ), it suffices (albeit need not be optimal) for the mapping S to change only q (xs ). Algorithmically, PX-VB bears a strong similarity to PX-EM (Liu et al., 1998). They both expand the original model and both are based on lower bounding KL-divergence. However, the key difference is that the reduction step in PX-VB changes the lower-bounding distributions {q (xs )}, while in PXEM the reduction step is performed only for the parameters in p(x, D). We also note that the PX-VB reduction step via M leaves the KL-divergence (lower bound on the likelihood) invariant, while in PX-EM the likelihood of the observed data remains the same after the reduction. Because of these differences, general EM acceleration methods (e.g., (Salakhutdinov et al., 2003)) can not be directly applied to speed up VB convergence. In the following sections, we present PX-VB methods for two popular Bayesian models: Probit regression for data classification and Automatic Relevance Determination (ARD) for feature selection and sparse learner. 3.1 Bayesian Probit regression Probit regression is a standard classification technique (see, e.g., (Liu et al., 1998) for the maximum likelihood estimation). Here we demonstrate the use of variational Bayesian methods to train Probit models. The data likelihood for Probit regression is p(t|X, w) = n  (tn wT xn ), (7)\n\nwhere X = [x1 , . . . , xN ] and  is the standard normal cumulative distribution function. We can rewrite the likelihood in an equivalent form p(tn |zn ) = sign(tn zn ) p(zn |w, xn ) = N (zn |wT xn , 1) (8)\n\nGiven a Gaussian prior over the parameter, p(w) = N (w|0, v0 I), n we wish to approximate the posterior distribution p(w, z|X, t) by q (w, z) = q (w) q (zn ). Minimizing n K L(q (w) q (zn ) p(w, z, t|X)), we obtain the following VB updates: q (zn ) = T N (zn | w\nT\n\nxn , 1, tn zn )\n- v0 1 I)-1 X\n\n(9) z , (XX +\nT - v0 1 I)-1 )\n\nq (w) = N (w|(XX +\n\nT\n\n(10) that\n\nwhere T N (zn | w T xn , 1, tn zn ) stands for a truncated Gaussian such T N (zn | w T xn , 1, tn zn ) = N (zn | w T xn , 1) when tn zn > 0, and it equals 0 otherwise.\n\n\f\nTo speed up the convergence of the above iterative updates, we apply the PX-VB method. First, we ^^ expand the orginal model p(w, z, t|X) to pc (w, z, t|X) with the mapping ^ ^ w = wc z = zc (11) such that pc (zn |w, xn ) = N (zn |wT xn , c2 ) p(w) = N (w|0, c2 v0 I) (12) Setting c = c0 = 1 in the expanded model, we updo te q (zn ) and q (w) as before, via (9) and (10). a q Then, we minimize K L (z)q (w) pc (w, z, t|X) ver c, yielding n 1 - 2 c2 = ( zn - 2 zn w T xn + xT wwT xn ) + v0 1 wwT (13) n N +M where M is the dimension of w. In the degenerate case where v0 = , the denominator of the above equation becomes N instead of N + M . Since this equation can be efficiently calculated, the extra computational cost induced by the auxiliary variable is therefore small. We omit the details. The transformation back to pc0 can be made via the inverse map w = w/c z = z/c. Accordingly, we change q (w) to obtain a new posterior approximation qc (w): - - qc (w) = N (w|(XXT + v0 1 I)-1 X z /c, (XXT + v0 1 I)-1 /c2 ) We do not actually need to compute qc (zn ) if this component will be optimized next. (14)\n\n(15)\n\nBy changing variables w to w through (14), the KL divergence between the approximate and exact posteriors remains the same. After obtaining new approximations qc (w) and q (zn ), we reset c = ^ c0 = 1 for the next iteration. Though similar to the PX-EM updates for the Probit regression problem (Liu et al., 1998), the PXVB updates are geared towards providing an approximate posterior distribution. We use both synthetic data and a kidney biopsy data (van Dyk & Meng, 2001) as numerical examples for probit regression. We set v0 =  in the experiment. The comparison of convergence speeds for VB and PXVB is illustrated in figure 1.\n0 VB PX-VB 0 -2 -4 -6 -8 0 1000 2000 3000 4000 5000 Number of iterations 0 2000 4000 6000 Number of iterations VB PX-VB\n\nlog(||wt+1-wt||)\n\n-4 -6 -8\n\n(a)\n\nlog(||wt+1-wt||)\n\n-2\n\n(b)\n\nFigure 1: Comparison between VB and PX-VB for probit regression on synthetic (a) and kidneybiospy data sets (b). PX-VB converges significantly faster than VB. Note that the Y axis shows the difference between two consecutive estimates of the posterior mean of the parameter w. For the synthetic data, we randomly sample a classifier and use it to define the data labels for sampled inputs. We have 100 training and 500 test data points, each of which is 20 features. The kidney data set has 55 data points, each of which is a 3 dimensional vector. On the synthetic data, PXVB converges immediately while VB updates are slow to converge. Both PX-VB and VB trained classifiers achieve zero test error. On the kidney biopsy data set, PX-VB converges in 507 iterations, while VB converges in 7518 iterations. In other words, PX-VB requires 15 times fewer iterations than VB. In terms of CPU time, which reflects the extra computational cost induced by the auxiliary variables, PX-VB is 14 times more efficient. Among all these runs, PX-VB and VB achieve very similar estimates of the model parameters and the same prediction results. In sum, with a simple modification of VB updates, we significantly improve the convergence speed of variational Bayesian estimation for probit model.\n\n\f\n3.2\n\nAutomatic Relevance Determination\n\nAutomatic relevance determination (ARD) is a powerful Bayesian sparse learning technique (MacKay, 1992; Tipping, 2000; Bishop & Tipping, 2000). Here, we focus on variational ARD proposed by Bishop and Tipping (2000) for sparse Bayesian regression and classification. The likelihood for ARD regression is p(t|X, w,  ) = n N (tn |wT n ,  -1 )\n\nwhere n is a feature vector based on xn , such as [k (x1 , xn ), . . . , [k (xN , xn )]T where k (xi , xj ) is a nonlinear basis function. For example, we can choose a radial basis function k (xi , xj ) = exp(- xi - xj /(22 ), where  is the kernel width. M - In ARD, we assign a Gaussian prior on the model parameters w: p(w|) = m=0 N (wm |0, m1 ), where the inverse variance diag() follows a factorized Gamma distribution: m m a- (16) p() = Gamma(m |a, b) = ba m 1 e-bm /(a) where a and b are hyperparameters of the model. The posterior does not have a closed form. Let us approximate p(w, ,  |X, t) by a factorized distribution q (w, ,  ) = q (w)q ()q ( ). The sequential VB updates on q ( ), q (w) and q () are described by Bishop and Tipping (2000). The variational RVM achieves good generalization performance as demonstrated by Bishop and Tipping (2000). However, its training based on the VB updates can be quite slow. We apply PX-VB to address this issue. ^^ First, we expand the original model p(w, ,  |X, t) via w = w/r\nM m =0\n\n(17)\n\n^ while maintaining  and  unchanged. Consequently, the data likelihood and the prior on w become ^ pr (t|w, X,  ) = n N (tn |rwT n ,  -1 ) pr (w|) =\n- N (wm |0, r-2 m1 )\n\n(18)\n\nSetting r = r0 = 1, we update q ( ) and q () as in the regular VB. Then, we want to joint optimize over q (w) and r. Instead of performing a fully joint optimization, we optimize q (w) and r separately at the same time. This gives g 2 + 16M f g+ r= (19) 4f m2 where f =  n xT wwT xn + m wT xn tn . where wT and wm m and g = 2  n T ww are the first and second order moments of the previous q (w). Since both f and XT w has been computed previously in VB updates, the added computational cost for r is negligible overall. The separate optimization over q (w) and r often decreases the KL divergence. But it cannot guarantee to achieve a smaller KL divergence than what optimization only over q (w) would achieves. If the regular update over q (w) achieves a smaller KL divergence, we reset r = 1. Given r and q (w), we use w = rw to reduce the expanded model to the original one. Cor^ respondingly, we change q (w) = N (w|w , w ) via this reduction rule to obtain qr (w) = N (w|rw , r2 w ). ^ We can also introduce another auxiliary variable s such that  = /s. Similar to the above procedure, we optimize over s the expected log joint probability of the expanded model, and at the same ^ ^ time update q (). Then we change q () back to qs () using the inverse mapping  = s. Due to the space limitation, we skip the details here. The auxiliary variables r and s change the individual approximate posteriors q (w) and q () separately. We can combine these two variables into one and use it to adjust q (w) and q () jointly. Specifically, we introduce the variable c: w = w/c  = c2 .\n\n\f\n0\n\nVB PX-VB\nlog(||wt+1-wt||)\n\n0 -2 -4 -6 -8\n\nVB PX-VB\nlog(||wt+1-wt||)\n\n-4 -5 -6 -7 -8\n\nVB PX-VB\n\n-w ||) log(||w\n\n-2 -4 -6 -8 0 500\n\nt+1\n\nt\n\nNumber of iterations\n\n1000\n\n1500\n\n2000\n\n2500\n\n0\n\n500\n\nNumber of iterations\n\n1000\n\n1500\n\n2000\n\n2500\n\n0\n\n500\n\nNumber of iterations\n\n1000\n\n1500\n\n2000\n\n2500\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Convergence comparison between VB and PX-VB for ARD regression on synthetic data (a,b) and gene expression data (c). The PX-VB results in (a) and (c) are based on independent auxiliar variables on w and . The PX-VB result in (b) is based on the auxiliar variable that correlates both w and . The added computational cost for PX-VB in each iteraction is negligible overall. Setting c = c0 = 1, we perform the regular updates over q ( ), q (w) and q (). Then we optimize over c the expected log joint probablity of the expanded model. We cannot find a closed-form solution for the maximization. But we can efficiently compute its gradient and Hessian. Therefore, we perform a few steps of Newton updates to partially optimize c. Again, the additional computational cost for calculating c is small. Then using the inverse mapping, we reduce the expanded model to the original one and adjust both q (w) and q () accordingly. Empirically, this approach can achieve faster convergence than using auxiliary variables on q (w) and q () separately. This is demonstrated in figure 2(a) and (b). We compare the convergence speed of VB and PX-VB for the ARD model on both synthetic data and gene expression data. The synthetic data are sampled from the function sinc(x) = (sinx)/x for x  (-10, 10) with added Gaussian noise. We use RBF kernels for the feature expansion n with kernel width 3. VB and PX-VB provide basically identical predictions. For gene expression data, we apply ARD to analyze the relationship between binding motifs and the expression of their target genes. For this task, we use 3 order polynomial kernels. The results of convergence comparison are shown in figure 2. With a little modification of VB updates, we increase the convergence speed significantly. Though we demonstrate PX-VB improvement only for ARD regression, the same technique can be used to speed up ARD classification.\n\n4 Convergence properties of VB and PX-VB\nIn this section, we analyze convergence of VB and PX-VB, and their convergence rates. Define the mapping q(t+1) = M (q(t) ) as one VB update of all the approximate distributions. Define an objective function as the unnormalized KL divergence: q q p q i (x) Q(q) = (x) log )+( (x)dx - (20) i i (x)dx). p(x) It is easy to check that minimizing Q(q) gives the same updates as VB which minimizes KL divergence. Based on Theorem 2.1 by Luo and Tseng (1992), an iterative application of this mapping to minimize Q(q) results in at least linear convergence to an element q in the solution set. Define the mapping q(t+1) = Mx (q(t) ) as one PX-VB update of all the approximate distributions. The convergence of PX-VB follows from similar arguments. i.e., = [qT T ]T converges to [q T T ]T , where    are the expanded model parameters, 0 are the null value in the original 0 model. 4.1 Convergence rate of VB and PX-VB\n\nThe matrix rate of convergence DM (q): q(t+1) - q\n=\n\nDM (q)T (q(t) - q\n\n)\n\n(21)\n\n\f\nwhere DM (q) =\n\n\n\nMj (q)  qi\n\n.\nq +1) (t\n\nDefine the global rate of convergence for q: r = limt q (t) --q . q Under certain regularity conditions, r = the largest eigenvalue of DM (q). The smaller r is, the faster the algorithm converges. Define the constraint set gs as the constraints for the sth update. Then the following theorem holds: Theorem 4.1 The matrix convergence rate for VB is: DM (q where Ps = Bs [BT s D2 Q(q\n) )\n\n=\n\nsS\n=1 )\n\nPs -1 and Bs = gs (q ).\n\n(22)\n\n-1\n\nBs ]-1 BT s\n\nD2\n\nQ(q\n\nProof: Define  as the current approximation q. Let Gs ( ) be qs that maximizes the objective function Q(q) under the constraint gs (q) = gs ( ) = [ \\s ]. Let M0 (q) = q and Ms (q) = Gs (Ms-1 (q)) for all 1  s  S. s = 1, . . . , S and DM (q\n: )\n\n(23)\n)\n\nThen by construction of VB, we have q(t+s/S ) = Ms (q(t) ), DMS (q ). At the stationary points, q = DMs (q ) for all s.\n\n=\n\nWe differentiate both sides of equation (23) and evaluate them at q = q\n\nDMs (q) = DMs-1 (q)DGS (Ms-1 (q )) = DMs-1 (q )DGS (q S It follows that DM (q ) = s=1 DGS (q ).\n\n(24)\n\nTo calculate DGS (q ), we differentiate the constraint gs (Gs ( )) = gs ( ) and evaluate both sides at  = q , such that DGs (q )Bs = Bs . (25) Similarly, we differentiate the Lagrange equation DQs (G( )) - gs (G( ))s ( ) = 0 and evaluate both sides at  = q . This yields DGs (q )D2 Qs (q Equation (26) holds because\n 2 gs  qi  qj )\n\n- Ds (q )BT = 0 s\n\n(26)\n\n= 0.\n\nCombining (25) and (26) yields DGs (q\n)\n\n= Bs [BT s\n\nD2\n\nQs (q\n\n)\n\n-1\n\nBs ]-1 BT s\n\nD2\n\nQs (q\n\n)\n\n-1\n\n.2\n\n(27)\n\nIn the s update we fix q\\s , i.e., gs (q) = q\\s . Therefore, Bs is an identity matrix with its sth column removed Bs = I:,s , where I is the identity matrix and s, : means without the sth column. D2 -1 Denote Cs = Qs (q ) . Without the loss of generality, we set s = S . It is easy to obtain BT C BS = C\\S,\\S S where \\S, \\S means without row S and column S . Inserting (28) into (27) yields I d-1 ) PS = DGS (q = 0\n-1 C\\S ,\\S C\\S ,S 0\n\n(28)\n\n=\n\nI\n\nd-1\n\n0\n\n-D2 Q\\S ,S (D2 QS,S )-1 0\n 2 Q(qq (x) p(x))  q\\S T  qS\n\n( 29)\n\nwhere Id-1 is a (d - 1) by (d - 1) identity matrix, and D2 Q\\S ,S =\n Q(qq (x) p(x)) .  qS T  qS\n2\n\nand D2 QS,S =\n\nNotice that we use Schur complements to obtain (29). Similar to the calculation of PS via (29), we can derive Ps for s = 1, . . . , S - 1 with structures similar to PS .\n\n\f\nThe above results help us understand the convergence speed of VB. For example, we have For\n(t+1) qS , qS\n\n-q =\n\nS\n\n-\n\nT T PS    P1 (q(t) - q ). ( (D2 QS,S )-1 D2 QS,\\S 0 q(t+(S -1)/S ) - q ).\n\nq(t+1) - q\n\n=\n\n(30)\n\nClearly, if we view D2 QS,\\S as the correlation between qS and q\\S , then the smaller \"correlation\", the faster the convergence. In the extreme case, if there is no correlation between qS and q\\S , then (t+1) qS - qS = 0 after the first iteration. Since the global convergence rate is bounded by the maximal component convergence rate and generally there are many components with convergence rate same as the global rate. Therefore, the instant convergence of qS could help increase the global convergence rate. For PX-VB, we can compute the matrix rate of convergence similarly. In the toy example in Section 2, PX-VB introduces an auxiliary variable  which has zero correlation with w, leading an instant convergence of the algorithm. This suggests that PX-VB improves the convergence by reducing the correlation among {qs }. Rigorously speaking, the reduction step in PXVB implictly defines a mapping between q to q0 through the auxiliary variables : (q, p0 )  (q, p )  (q , p0 ). Denote this mapping as M such as q = M (q). Then we have DMx (q ) = DG1 (q )    DG (q )    DGS (q ) It is known that the spectral norm has the following submultiplicative property EF <= E F , where E and F are two matrices. Thus, as long as the largest eigenvalue of M is smaller than 1, PX-VB converges faster than VB. The choice of  affects the convergence rate by controlling the eigenvalue of this mapping. The smaller the largest eigenvalue of M , the faster PX-VB converges. In practice, we can check this eigenvalue to make sure the constructed PX-VB algorithm enjoys a fast convergence rate.\n\n5 Discussion\nWe have provided a general approach to speeding up convergence of variational Bayesian learning. Faster convergence is guaranteed theoretically provided that the Jacobian of the transformation from auxiliary parameters to variational components has spectral norm bounded by one. This property can be verified in each case separately. Our empirical results show that the performance gain due to the auxiliary method is substantial. Acknowledgments\nT. S. Jaakkola was supported by DARPA Transfer Learning program.\n\nReferences\nBeal, M. (2003). Variational algorithms for approximate Bayesian inference. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London. Bishop, C., & Tipping, M. E. (2000). Variational relevance vector machines. 16th UAI. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods in graphical models. Learning in Graphical Models. http://www.ai.mit.edu/~tommi/papers.html. Liu, C., Rubin, D. B., & Wu, Y. N. (1998). Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755770. Liu, J. S., & Wu, Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association, 94, 12641274. Luo, Z. Q., & Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72, 735. MacKay, D. J. (1992). Bayesian interpolation. Neural Computation, 4, 415447. Salakhutdinov, R., Roweis, S. T., & Ghahramani, Z. (2003). Optimization with EM and Expectation-ConjugateGradient. Proceedings of International Conference on Machine Learning. Tipping, M. E. (2000). The relevance vector machine. NIPS (pp. 652658). The MIT Press. van Dyk, D. A., & Meng, X. L. (2001). The art of data augmentation (with discussion). Journal of Computational and Graphical Statistics, 10, 1111.\n\n\f\n", "award": [], "sourceid": 3041, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Yuan", "family_name": "Qi", "institution": null}]}