{"title": "Hierarchical Mixtures of Experts Methodology Applied to Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 859, "page_last": 865, "abstract": null, "full_text": "Hierarchical Mixtures of Experts Methodology Applied to \n\nContinuous Speech Recognition \n\nYing Zhao, Richard Schwartz, Jason Sroka*: John Makhoul \n\nBBN System and Technologies \n\n70 Fawcett Street \n\nCambridge MA 02138 \n\nAbstract \n\nIn this paper, we incorporate the Hierarchical Mixtures of Experts (HME) \nmethod of probability estimation, developed by Jordan [1], into an HMM(cid:173)\nbased continuous speech recognition system. The resulting system can be \nthought of as a continuous-density HMM system, but instead of using gaussian \nmixtures, the HME system employs a large set of hierarchically organized but \nrelatively small neural networks to perform the probability density estimation. \nThe hierarchical structure is reminiscent of a decision tree except for two \nimportant differences: each \"expert\" or neural net performs a \"soft\" decision \nrather than a hard decision, and, unlike ordinary decision trees, the parameters \nof all the neural nets in the HME are automatically trainable using the EM \nalgorithm. We report results on the ARPA 5,OOO-word and 4O,OOO-word Wall \nStreet Journal corpus using HME models. \n\n1 Introduction \n\nRecent research has shown that a continuous-density HMM (CD-HMM) system can out(cid:173)\nperform a more constrained tied-mixture HMM system for large-vocabulary continuous \nspeech recognition (CSR) when a large amount of training data is available [2]. In other \nwork, the utility of decision trees has been demonstrated in classification problems by \nusing the \"divide and conquer\" paradigm effectively, where a problem is divided into a \nhierarchical set of simpler problems. We present here a new CD-HMM system which \n\n**MIT, Cambridge MA 02139 \n\n\f860 \n\nYing Zhao, Richard Schwartz, Jason Sroka, John Makhoul \n\nhas similar properties and possesses the same advantages as decision trees, but has the \nadditional important advantage of having automatically trainable \"soft\" decision bound(cid:173)\naries. \n\n2 Hierarchical Mixtures of Experts \n\nThe method of Hierarchical Mixtures of Experts (HME) developed recently by Jordan \n[1] breaks a large scale task into many small ones by partitioning the input space into \na nested set of regions, then building a simple but specific model (local expert) in each \nregion. The idea behind this method follows the principle of divide-and-conquer which \nhas been utilized in certain approaches to classification problems, such as decision trees. \nIn the decision tree approach, at each level of the tree, the data are divided explicitly into \nregions. In contrast, the HME model makes use of \"soft\" splits of the data, i.e., instead of \nthe data being explicitly divided into regions, the data may lie simultaneously in multiple \nregions with certain probabilities. Therefore, the variance-increasing effect of lopping off \ndistant data in the decision tree can be ameliorated. Furthermore, the \"hard\" boundaries \nin the decision tree are fixed once a decision is made, while the \"soft\" boundaries in \nthe HME are parameterized with generalized sigmoidal functions, which can be adjusted \nautomatically using the Expectation-Maximization (EM) algorithm during the splitting. \n\nNow we describe how to apply the HME methodology to the CSR problem. For each \nstate of a phonetic HMM, a separate HME is used to estimate the likelihood. The actual \nHME first computes a posterior probability P(llz, s), the probability of phoneme class \nI, given the input feature vector z and state s. That probability is then divided by the a \npriori probability of the phone class I at state s. A one-level HME performs the following \ncomputation: \n\nc \n\nP(llz, s) = L P(llci, z, s)P(cilz, s) \n\ni=l \n\n(1) \n\nwhere I = 1, , .. , L indicates phoneme class, Ci represents a local region in the input space, \nand C is the number of regions. P(cilz, s) can be viewed as a gating network, while \nP(lICi, z, s) can be viewed as a local expert classifier (expert network) in the region c, \n[1]. In a two-level HME, each region Ci is divided in turn into C subregions. The term \nP(IICi, z, s) is then computed in a similar manner to equation (1), and so on. If in some \nof these subregions there are no data available, we back off to the parent network. \n\n3 TECHNICAL DETAILS \n\nAs in Jordan's paper, we use a generalized sigmoidal function to parameterize P(cilz) \nas follows: \n\n(2) \n\nwhere z can be the direct input (in a one-layer neural net) or the hidden layer vector (in a \ntwo-layer neural net), and v,, i = 1, .. \" C are weights which need to train. Similarly, the \nlocal phoneme classifier in region Ci, P(llc\" z), can be parameterized with a generalized \n\n\fMixtures of Experts Applied to Continuous Speech Recognition \n\nsigmoidal function also: \n\n861 \n\n(3) \n\nwhere 8;i,j = 1, ... , L are weights. The whole system consists of two set of parameters: \nVi, i = 1, ... , C and 8;i' j = 1, ... , L, e = {8;i' Vi}. All parameters are estimated by using \nthe EM algorithm. \n\nThe EM is an iterative approach to maximum likelihood estimation. Each iteration of an \nEM algorithm is composed of two steps: an Expectation (E) step and a Maximization \n(M) step. The M step involves the maximization of a likelihood function that is redefined \nin each iteration by the E step. Using the parameterizations in (2) and (3), we obtain the \nfollowing iterative procedure for computing parameters e = {Vi, 8;i}: \n1 . .. I' \n. lDltIa lze Vi an \n2. E-step: In each iteration n, for each data pair (z(t), l(t\u00bb, t = 1, ... ,N, compute \n\nI \n, ... , ,} = \n\nC ' 1 \n\n, ... , \n\nL \n. \n\n(0) \n\nd 8(0) f \n\n;i or 1. = \n\n. \n\nzi(tin) = P(cilz(t), l(t), e(n~ \n\nP(Ci Iz(t), v~n\u00bbp(l(t)lci' z(t), 8~~~,i) \n\n= \n\n(4) \n\nwhere i = 1, ... , C. Zi(t)