{"title": "On a Modification to the Mean Field EM Algorithm in Factorial Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 431, "page_last": 437, "abstract": null, "full_text": "On a Modification to the Mean Field EM \n\nAlgorithm in Factorial Learning \n\nA. P. Dunmur \n\nD. M. Titterington \n\nDepartment of Statistics \n\nMaths Building \n\nUniversity of Glasgow \nGlasgow G12 8QQ, UK \n\nalan~stats.gla.ac.uk \n\nmike~stats.gla.ac.uk \n\nAbstract \n\nA modification is described to the use of mean field approxima(cid:173)\ntions in the E step of EM algorithms for analysing data from latent \nstructure models, as described by Ghahramani (1995), among oth(cid:173)\ners. The modification involves second-order Taylor approximations \nto expectations computed in the E step. The potential benefits of \nthe method are illustrated using very simple latent profile models. \n\n1 \n\nIntroduction \n\nGhahramani (1995) advocated the use of mean field methods as a means to avoid the \nheavy computation involved in the E step of the EM algorithm used for estimating \nparameters within a certain latent structure model, and Ghahramani & Jordan \n(1995) used the same ideas in a more complex situation. Dunmur & Titterington \n(1996a) identified Ghahramani's model as a so-called latent profile model, they \nobserved that Zhang (1992,1993) had used mean field methods for a similar purpose, \nand they showed, in a simulation study based on very simple examples, that the \nmean field version of the EM algorithm often performed very respectably. By this \nit is meant that, when data were generated from the model under analysis, the \nestimators of the underlying parameters were efficient, judging by empirical results, \nespecially in comparison with estimators obtained by employing the 'correct' EM \nalgorithm: the examples therefore had to be simple enough that the correct EM \nalgorithm is numerically feasible, although any success reported for the mean field \n\n\f432 \n\nA. P. Dunmur and D. M. Titterington \n\nversion is, one hopes, an indication that the method will also be adequate in more \ncomplex situations in which the correct EM algorithm is not implement able because \nof computational complexity. \n\nIn spite of the above positive remarks, there were circumstances in which there was \na perceptible, if not dramatic, lack of efficiency in the simple (naive) mean field \nestimators, and the objective of this contribution is to propose and investigate ways \nof refining the method so as to improve performance without detracting from the \nappealing, and frequently essential, simplicity of the approach. The procedure used \nhere is based on a second order correction to the naive mean field well known in \nstatistical physics and sometimes called the cavity or TAP method (Mezard, Parisi \n& Virasoro , 1987). It has been applied recently in cluster analysis (Hofmann & \nBuhmann, 1996). In Section 2 we introduce the structure of our model, Section 3 \nexplains the refined mean field approach, Section 4 provides numerical results, and \nSection 5 contains a statement of our conclusions. \n\n2 The Model \n\nThe model under study is a latent profile model (Henry, 1983), which is a latent \n: r = 1 ... p} and discrete \nstructure model involving continuous observables {x r \nlatent variables {Yi : i = 1 ... d}. The Yi are represented by indicator vectors such \nthat for each i there is a single j such that Yij = 1 and Yik = 0, for all k f; j. The \nlatent variables are connected to the observables by a set of weight matrices Wi \nin such a way that the distribution of the observations given the latent variables \nis a multivariate Gaussian with mean Ei WiYi and covariance matrix r. To ease \nthe notation, the covariance matrix is taken to be the identity matrix, although \nextension is quite easy to the case where r is a diagonal matrix whose elements have \nto be estimated (Dunmur & Titterington, 1996a). Also to simplify the notation, \nthe marginal distributions of the latent variables are taken to be uniform, so that \nthe totality of unknown parameters is made up of the set of weight matrices, to be \ndenoted by W = (WI, W2 , \u2022\u2022 \u00b7, Wd). \n\n3 Methodology \n\nIn order to learn about the model we have available a dataset V = {xl-' : J.L = 1 ... N} \nof N independent, p dimensional realizations of the model, and we adopt the Max(cid:173)\nimum Likelihood approach to the estimation of the weight matrices. As is typical \nof latent structure models, there is no explicit Maximum Likelihood estimate of the \nparameters of the model, but there is a version of the EM algorithm (Dempster, \nLaird & Rubin, 1977) that can be used to obtain the estimates numerically. The \nEM algorithm consists of a sequence of double steps, E steps and M steps. \nAt stage m the E step, based on a current estimate wm-I of the parameters, \ncalculates \n\nwhere, the expectation ( . ) is over the latent variables y, and is conditional on V and \nwm-l, and Cc denotes the crucial part of the complete-data log-likelihood, given \n\n\fA Modification to Mean Field EM \n\n433 \n\nby \n\nThe M step then maximizes Q with respect to W and gives the new parameter \nestimate W m . \n\nFor the simple model considered here, the M step gives \n\nwhere W = (WI, W2 , ..\u2022 , Wd ) and yT = (Yf, Yf, ... ,yJ) and, for brevity, explicit \nmention of the conditioned quantities in the expectations (.) has been omitted. \nThe above formula differs somewhat from that given by Ghahramani (1995). \n\nHence we need to evaluate the sets of expectations (Yi) and (YiyJ) for each ex(cid:173)\nample in the dataset. (The superscript J.t is omitted, for clarity.) As pointed out \nin Ghahramani (1995), it is possible to evaluate these expectations directly by \nsumming over all possible latent states. This has the disadvantage of becoming \nexponentially more expensive as the size of the latent space increases. \n\nThe mean field approximation is well known in physics and can be used to reduce \nthe computational complexity. At its simplest level, the mean field approximation \nreplaces the joint expectations of the latent variables by the products of the indi(cid:173)\nvidual expectations; this can be interpreted as bounding the likelihood from below \n(Saul, Jaakkola, Jordan, 1996). Here we consider a second order approximation, as \noutlined below. \n\nSince the latent variables are categorical, it is simple to sum over the state space \nof a single latent variable. Hence, following Parisi (1988), the expectations of the \nlatent variables are given by \n\n(1) \nwhere /j(fi) is the lh component of the softmax function; exp(fij)/ ~k exp(fik), \nand the expectation (.) is taken over the remaining latent variables. The vector \nfi = {fij} contains the log probabilities (up to a constant) associated with each \ncategory of the latent variable for each example in the data set. For the simple \nmodel under study fij is given by \n\nf' 0 = {W!(x - '\" WkYk)} - !(W!Wo) 0 0 . \n\nt \n\n1 11 \n\nI) \n\nt \n\nL.J k#i \n\nj \n\n2 \n\n(2) \n\nThe expectation in (1) can be expanded in a Taylor series about the average, (fi), \ngiving \n\n(3) \n\nwhere D.fij = fij - (fij ). The naive mean field approximation simply ignores all \ncorrections. We can postulate that the second order fluctuations are taken care of \n\n\f434 \n\nA. P. Dunmur and D. M. Titterington \n\nby a so called cavity field, (see, for instance, Mezard, Parisi & Virasoro , 1987, p.16), \nthat is, \n\n(4) \n\nwhere the vector of fields hi = {hik} has been introduced to take care of the \ncorrection terms. This equation may also be expanded in a Taylor series to give \n\nThen, equating coefficients with (3) and after a little algebra, we get \n\nwhere djk is the Kronecker delta and, for the model under consideration, \n\n(tl\u20acijtl\u20acik) \n\n= (wr L, Wm (( YmY~) - (Ym) (y~\u00bb) W!Wi) \n\nmn#~ \n\njk \n\nThe naive mean field assumption may be used in (6), giving \n\n(5) \n\n(6) \n\n(7) \n\nWithin the E step, for each realization in the data set, the mean fields (4), along \nwith the cavity fields (5), can be evaluated by an iterative procedure which gives the \nindividual expectations of the latent variables. The naive mean field approximation \n(7) is then used to evaluate the joint expectations (YiyJ). In the next section we \nreport, for a simple model, the effect on parameter estimation of the use of cavity \nfields. \n\n4 Results \n\nSimulations were carried out using latent variable models (i) with 5 observables \nand 4 binary hidden variables and (ii) with 5 observables and 3 3-state hidden \nvariables. The weight matrices were generated from zero mean Gaussian variables \nwith standard deviation w. In order to make the M step trivial it was assumed that \nthe matrices were known up to the scale parameter w, and this is the parameter \nestimated by the algorithm. A data set was generated using the known parameter \nand this was then estimated using straight EM, naive mean field (MF) and mean \nfield with cavity fields (MF cay ). \n\nAlthough datasets of sizes 100 and 500 were generated, only the results for N = 500 \nare presented here since both scenarios showed the same qualitative behaviour. Also, \nthe estimation algorithms were started from different initial positions; this too had \nno effect on the final estimates of the parameters. A representative selection of \nresults follows; Table 1 shows the results for both the 5 x 4 x 2 model and the \n5 x 3 x 3 model. \n\nThe results show that, when the true value, Wtr, of the parameter was small, there \nis little difference among the three methods. This is due to the fact that at these \n\n\fA Modification to Mean Field EM \n\n435 \n\nTable 1: Table of results N=500, results averaged over 50 simulations for 5 ob(cid:173)\nservables with 4 binary latent variables and for 5 observables with 3 3-state latent \nvariables. The figures in brackets give the standard deviation of the estimates, West, \nin units according to the final decimal place. RMS is the root mean squared error \nof the estimate compared to the true value. \n\n5x4x2 \n\n5x3x3 \n\nMethod Wtr Winit \n0.05 \nEM \nMF \n0.05 \n0.05 \nMFcav \nEM \n0.1 \nMF \n0.1 \n0.1 \nMFcav \nEM \n0.1 \nMF \n0.1 \n0.1 \nMFcav \n0.2 \nEM \n0.2 \nMF \n0.2 \nMFcav \n0.1 \nEM \n0.1 \nMF \n0.1 \nMFcav \n\n0.1 \n0.1 \n0.1 \n0.5 \n0.5 \n0.5 \n1.0 \n1.0 \n1.0 \n2.0 \n2.0 \n2.0 \n5.0 \n5.0 \n5.0 \n\nWest \n\nWest \n\nRMS \n\nRMS \n0.09(1) 0.014 0.10(2) 0.024 \n0.09(1) 0.014 0.10(2) 0.023 \n0.09(1) 0.014 0.10(2) 0.023 \n0.49(2) 0.016 0.50(2) 0.019 \n0.47(2) \n0.029 0.46(2) 0.038 \n0.48(2) 0.026 0.47(2) 0.032 \n0.99(2) 0.016 1.00(2) 0.018 \n0.96(2) 0.040 0.98(2) 0.032 \n0.99(2) 0.018 1.00(2) 0.018 \n1.99(1) 0.014 1.99(1) 0.015 \n1.98(2) 0.021 \n1.98(2) 0.023 \n1.97(2) 0.027 1.97(2) 0.031 \n4.99(1) 0.013 5.00(1) 0.013 \n4.99(1) 0.016 4.96(2) 0.047 \n4.97(2) 0.032 4.88(3) \n0.114 \n\nsmall values there is little separation among the mixtures that are used to generate \nthe data and hence the methods are all equally good (or bad) at estimating the \nparameter. As the true parameter increases and becomes close to one, the cavity \nfield method performs significantly better then naive mean field; in fact it performs \nas well as EM for Wtr=1. \n\nFor values of Wtr greater than one, the cavity field method performs less well than \nnaive mean field. This suggests that the Taylor expansions (3) and (5) no longer \nprovide reliable approximations. Since \n\n(8) \n\nit is easy to see that if the elements in Wi are much less than one then the cor(cid:173)\nrections to the Taylor expansion are small and hence the cavity fields are small, \nso the approximations hold. If W is much larger than one, then the mean field \nestimates become closer to zero and one since the energies f.ij (equation 2) become \nmore extreme. Hence if the mean fields correctly estimate the latent variables the \ncorrections are indeed small, but if the mean fields incorrectly estimate the latent \nvariable the error term is substantial, leading to a reduction in performance. \n\nAnother simulation, similar to that presented in Ghahramani (1995), was also stud(cid:173)\nied to compare the modification to both the 'correct' EM and the naive mean field. \nThe model has two latent variables that correspond to either horizontal or vertical \n\n\f436 \n\nA. P. Dunmur and D. M. Titterington \n\nlines in one of four positions. These are combined and zero mean Gaussian noise \nadded to produce a 4 x 4 example image. A data set is created from many of these \nexamples with the latent variables chosen at random. From the data set the weight \nmatrices connecting the observables to the latent variables are estimated and com(cid:173)\npared to the true weights which consist of zeros and ones. Typical results for a \nsample size of 160 and Gaussian noise of variance 0.2 are presented in Figure 1. \nThe number of iterations needed to converge were similar for all three methods. \n\nEM \n0.158 \n\nMF \n0.154 \n\nMFcav \n0.153 \n\nliD \n\nFigure 1: Estimated weights for a sample size of N = 160 and noise variance 0.2 \nadded. The three rows correspond to EM, naive MF and MF cay respectively. The \nnumber on the left hand end of the row is the mean squared error of the estimated \nweights compared with the true weights. The first four images are the estimates of \nthe first latent vector and the remaining four images are the estimates of second \nlatent vector. \n\nAs can be seen from Figure 1 there is very little difference between the estimates of \nthe weights and the mean squared errors are all very close. The mean field method \nconverged in approximately four iterations which means that for this simple model \nthe MF E step is taking approximately 32 steps as compared to 16 steps for the \nstraight EM. This is due to the simplicity of the latent structure for this model. For \na more complicated model the MF algorithm should take fewer iterations. Again \nthe results are encouraging for the MF method, but they do not show any obvious \nbenefit from using the cavity field correction terms. \n\n5 Conclusion \n\nThe cavity field method can be applied successfully to improve the performance \nof naive mean field estimates. However, care must be taken when the corrections \nbecome large and actually degrade the performance. Predicting the failure modes of \nthe algorithm may become harder for larger (more realistic) models. The message \nseems to be that where the mean field does well the cavity fields will improve the \nsituation, but where the mean field performs less well the cavity fields can degrade \nperformance. This suggests that the cavity fields could be used as a check on \nthe mean field method. Where the cavity fields are small we can be reasonably \nconfident that the mean field is producing sensible answers. However where the \ncavity fields become large it is likely that the mean field is no longer producing \naccurate estimates. \n\nFurther work would consider larger simulations using more realistic models. It \n\n\fA Modification to Mean Field EM \n\n437 \n\nmight no longer be feasible to compare these simulations with the 'correct' EM \nalgorithm as the size of the model increases, though other techniques such as Gibbs \nsampling could be used instead. It would also be interesting to look at the next \nlevel of approximation where, instead of approximating the joint expectations by \nthe product of the individual expectations in equation (6), the joint expectations \nare evaluated by summing over the joint state space (c.f. equation (1)) and possibly \nevaluating the corresponding cavity fields (Dunmur & Titterington, 1996b). This \nwould perhaps improve the quality of the approximation without introducing the \nexponential complexity associated with the full Estep. \n\nAcknowledgements \n\nThis research was supported by a grant from the UK Engineering and Physical \nSciences Research Council. \n\nReferences \n\nDEMPSTER, A. P., LAIRD, N. M. & RUBIN, D. B. (1977). Maximum likelihood \nestimation from incomplete data via the EM algorithm (with discussion). J. R. \nStatist. Soc. B 39, 1-38. \n\nDUNMUR, A. P. & TITTERINGTON, D. M. (1996a). Parameter estimation in latent \n\nstructure models. Tech. Report 96-2, Dept. Statist., Univ. Glasgow. \n\nDUNMUR, A. P. & TITTERINGTON, D. M. (1996b). Higher order mean field ap(cid:173)\n\nproximations. In preparation. \n\nGHAHRAMANI, Z. (1995). Factorial learning and the EM algorithm. In Advances in \nNeural Information Processing Systems 7, Eds. G. Tesauro, D. S. Touretzky & \nT. K. Leen. Cambridge MA: MIT Press. \n\nGHAHRAMANI, Z. & JORDAN, M. I. (1995). Factorial hidden Markov models. \n\nComputational Cognitive Science Technical Report 9502, MIT. \n\nHENRY, N. W. (1983). Latent structure analysis. In Encyclopedia of Statistical \nSciences, Volume 4, Eds. S. Kotz, N. L. Johnson & C. B. Read, pp.497-504. New \nYork: Wiley. \n\nHOFMANN, T. & BUHMANN, J. M. (1996) Pairwise Data Clustering by Determin(cid:173)\n\nistic Annealing. Tech. Rep. IAI-TR-95-7, Institut fur Informatik III, Universitiit \nBonn. \n\nMEZARD, M., PARISI, G. & VIRASORO, M. A. (1987) Spin Glass Theory and \n\nBeyond. Lecture Notes in Physics, 9. Singapore: World Scientific. \n\nPARISI, G. (1988). Statistical Field Theory. Redwood City CA: Addison-Wesley. \n\nSAUL, L. K., JAAKKOLA, T. & JORDAN, M. I. (1996) Mean Field Theory for \n\nSigmoid Belief Networks. J. Artificial Intelligence Research 4,61-76. \n\nZHANG, J. (1992). The Mean Field Theory in EM procedures for Markov random \n\nfields. 1. E. E. E. Trans. Signal Processing 40, 2570-83. \n\nZHANG, J. (1993). The Mean Field Theory in EM procedures for blind Markov \n\nrandom field image restoration. 1. E. E. E. Trans. Image Processing 2, 27-40. \n\n\f", "award": [], "sourceid": 1218, "authors": [{"given_name": "A.", "family_name": "Dunmur", "institution": null}, {"given_name": "D.", "family_name": "Titterington", "institution": null}]}