{"title": "Forward-Decoding Kernel-Based Phone Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1189, "page_last": 1196, "abstract": null, "full_text": "Forward-Decoding Kernel-Based \n\nPhone Sequence Recognition \n\nShantanu Chakrabartty and Gert Cauwenberghs \n\nCenter for Language and Speech Processing \n\nDepartment of Electrical and Computer Engineering \n\nJohns Hopkins University, Baltimore MD 21218 \n\n{shantanu,gert}@jhu.edu \n\nAbstract \n\nForward decoding kernel machines (FDKM) combine large-margin clas(cid:173)\nsifiers with hidden Markov models (HMM) for maximum a posteriori \n(MAP) adaptive sequence estimation. State transitions in the sequence \nare conditioned on observed data using a kernel-based probability model \ntrained with a recursive scheme that deals effectively with noisy and par(cid:173)\ntially labeled data. Training over very large data sets is accomplished us(cid:173)\ning a sparse probabilistic support vector machine (SVM) model based on \nquadratic entropy, and an on-line stochastic steepest descent algorithm. \nFor speaker-independent continuous phone recognition, FDKM trained \nover 177 ,080 samples of the TlMIT database achieves 80.6% recognition \naccuracy over the full test set, without use of a prior phonetic language \nmodel. \n\n1 Introduction \n\nSequence estimation is at the core of many problems in pattern recognition, most notably \nspeech and language processing. Recognizing dynamic patterns in sequential data requires \na set of tools very different from classifiers trained to recognize static patterns in data \nassumed i.i.d. distributed over time. \nThe speech recognition community has predominantly relied on hidden Markov models \n(HMMs) [1] to produce state-of-the-art results. HMMs are generative models that function \nby estimating probability densities and therefore require a large amount of data to estimate \nparameters reliably. If the aim is discrimination between classes, then it might be sufficient \nto model discrimination boundaries between classes which (in most affine cases) afford \nfewer parameters. \nRecurrent neural networks have been used to extend the dynamic modeling power of \nHMMs with the discriminant nature of neural networks [2], but learning long term depen(cid:173)\ndencies remains a challenging problem [3]. Typically, neural network training algorithms \nare prone to local optima, and while they work well in many situations, the quality and \nconsistency of the converged solution cannot be warranted. \nLarge margin classifiers, like support vector machines, have been the subject of intensive \nresearch in the neural network and artificial intelligence communities [4]. They are attrac(cid:173)\ntive because they generalize well even with relatively few data points in the training set, and \nbounds on the generalization error can be directly obtained from the training data. Under \ngeneral conditions, the training procedure finds a unique solution (decision or regression \nsurface) that provides an out-of-sample performance superior to many techniques. \nRecently, support vector machines (SVMs) [4] have been used for phoneme (or phone) \nrecognition [5] and have shown encouraging results. However, use of a standard SVM \n\n\fP(xI1) \n\nP(xIO) \n\nP(111 ) \n\nP(OIO) \n\nP(110) \n\n(a) \n\nP(110,x) \n(b) \n\nFigure 1: (a) Two state Markovian maximum-likehood (ML) model with static state transi(cid:173)\ntion probabilities and observation vectors xemittedfrom the states. (b) Two state Markovian \nMAP model, where transition probabilities between states are modulated by the observa(cid:173)\ntion vector x. \n\nclassifier by itself implicitly assumes i.i.d. data, unlike the sequential nature of phones. \nTo model inter-phonetic dependencies, maximum likelihood (ML) approaches assume a \nphonetic language model that is independent of the utterance data [6], as illustrated in Fig(cid:173)\nure 1 (a). In contrast, the maximum a posteriori (MAP) approach assumes transitions be(cid:173)\ntween states that are directly modulated by the observed data, as illustrated in Figure 1 (b). \nThe MAP approach lends itself naturally to hybrid HMM/connectionist approaches with \nperformance comparable to state-of-the-art HMM systems [7]. \nFDKM [8] can be seen a hybrid HMM/SYM MAP approach to sequence estimation. It \nthereby augments the ability of large margin classifiers to infer sequential properties of \nthe data. FDKMs have shown superior performance for channel equalization in digital \ncommunication where the received symbol sequence is contaminated by inter symbol in(cid:173)\nterference [8]. \nIn the present paper, FDKM is applied to speaker-independent continuous phone recogni(cid:173)\ntion. To handle the vast amount of data in the TIMIT corpus, we present a sparse proba(cid:173)\nbilistic model and efficient implementation of the associated FDKM training procedure. \n\n2 FDKM formulation \n\nThe problem of FDKM recognition is formulated in the framework of MAP (maximum a \nposteriori) estimation, combining Markovian dynamics with kernel machines. A Marko(cid:173)\nvian model is assumed with symbols belonging to S classes, as illustrated in Figure I(a) \nfor S = 2. Transitions between the classes are modulated in probability by observation \n(data) vectors x over time. \n\n2.1 Decoding Formulation \nThe MAP forward decoder receives the sequence X [n] = {x[n], x [n - 1], ... ,x li]} \nand produces an estimate of the probability of the state variable q[n] over all classes i, \nadn] = P(q[n] = i I X [n], w) , where w denotes the set of parameters for the learning \nmachine. Unlike hidden Markov models, the states directly encode the symbols, and the \nobservations x modulate transition probabilities between states [7]. Estimates of the poste(cid:173)\nrior probability a i [n] are obtained from estimates of local transition probabilities using the \nforward-decoding procedure [7] \n\nS - l \n\nadn] = L Pij[n] aj[n - 1] \n\nj =O \n\n(1) \n\nwhere Pij [n] = P(q[n] = \nj , x[n], w) denotes the probability of making \na transition from class j at time n - 1 to class i at time n, given the current observation \nvector x [n]. The forward decoding (1) embeds sequential dependence of the data wherein \nthe probability estimate at time instant n depends on all the previous data. An on-line \n\ni I q[n - 1] = \n\n\festimate of the symbol q[n] is thus obtained: \n\nq est [n] = arg max ai [n] \n\nt \n\n(2) \n\nThe BCJR forward-backward algorithm [9] produces in principle a better estimate that \naccounts for future context, but requires a backward pass through the data, which is im(cid:173)\npractical in many applications requiring real time decoding. \nAccurate estimation of transition probabilities Pij [n ] in (1) is crucial in decoding (2) to \nprovide good performance. In [8] we used kernel logistic regression [10], with regular(cid:173)\nized maximum cross-entropy, to model conditional probabilities. A different probabilistic \nmodel that offers a sparser representation is introduced below. \n\n2.2 Training Formulation \nFor training the MAP forward decoder, we assume access to a training sequence with la(cid:173)\nbels (class memberships). For instance, the TIMIT speech database comes labeled with \nphonemes. Continuous (soft) labels could be assigned rather than binary indicator labels, \nto signify uncertainty in the training data over the classes. Like probabilities, label assign-\n\nments are normalized: L;:Ol ydn] = 1, ydn] :::: 0. \nThe objective of training is to maximize the cross-entropy of the estimated probabilities \nadn] given by (1) with respect to the labels Ydn] over all classes i and training data n \n\nN - 1 8 - 1 \n\nH = L L Yd n]log adn] \n\nn = O i = O \n\n(3) \n\nTo provide capacity control we introduce a regularizer fl( w) in the objective function [II). \nThe parameter space w can be partitioned into disjoint parameter vectors W ij and bij for \neach pair of classes i , j = 0, ... , S - 1 such that Pij [n] depends only on W i j and bij . \n(The parameter bij corresponds to the bias term in the standard SVM formulation). The \nregularizer can then be chosen as the L 2 norm of each disjoint parameter vector, and the \nobjective function becomes \n\nN - 1 8 - 1 \n\n1 8 - 1 8 - 1 \n\nH = C L Lydn]logadn] - \"2 L L \n\nIW ij l2 \n\nn = O i = O \n\nj = O i = O \n\n(4) \n\nwhere the regularization parameter C controls complexity versus generalization as a bias(cid:173)\nvariance trade-off [11). The objective function (4) is similar to the primal formulation of \na large margin classifier [4]. Unlike the convex (quadratic) cost function of SVMs, the \nformulation (4) does not have a unique solution and direct optimization could lead to poor \nlocal optima. However, a lower bound of the objective function can be formulated so that \nmaximizing this lower bound reduces to a set of convex optimization sub-problems with \nan elegant dual formulation in terms of support vectors and kernels. Applying the convex \nproperty of the - log(.) function to the convex sum in the forward estimation (1), we obtain \ndirectly \n\nwhere \n\nN - 1 \n\n8 - 1 \n\nH j = L Cj [n] L yd n]log Pij [n] - ~ L \n\n8 - 1 \n\nwith effective regularization sequence \n\nn = O \n\ni = O \n\ni = O \n\nCj[n] = Caj[n - 1] . \n\nIWij 12 \n\n(5) \n\n(6) \n\n(7) \n\nDisregarding the intricate dependence of (7) on the results of (6) which we defer to the fol(cid:173)\nlowin~ section, the formulation (6) is equivalent to regression of conditional probabilities \nPij [n j from labeled data x [n] and Yi [n], for a given outgoing state j. \n\n\f2.3 Kernel Logistic Probability Regression \nEstimation of conditional probabilities Pr( ilx) from training data x[n] and labels Yi [n] can \nbe obtained using a regularized form of kernel logistic regression [10]. For each outgoing \nstate j, one such probabilistic model can be constructed for the incoming state i conditional \nonx[n]: \n\nPij [n] = exp(fij (x [n])) I L exp(f8j (x[n])) \n\n(8) \n\n5 - 1 \n\nAs with SVMs, dot products in the expression for i ij (x) in (8) convert into kernel expan(cid:173)\nsions over the training data x[m] by transforming the data to feature space [12] \n\n8= 0 \n\ni ij (x) \n\nWij \u00b7X + bij \nLX?] x[m].x + bij \nm \n\n----+ 6 A0 K(x [m], x) + bij \n
(x).(y) . The map <1>(-) need not be computed explicitly, as it only appears in \n\ninner-product form. \n\n\f2.4 GiniSVM formulation \nThe GiniSVM probabilistic model [15] provides a sparse alternative to logistic regres(cid:173)\nsion. A quadratic ('Gini' [16]) index replaces entropy in the dual formulation of logistic \nregression. The 'Gini' index provides a lower bound of the dual logistic functional, and its \nquadratic form produces sparse solutions as with support vector machines. The tightness \nof the bound provides an elegant trade-off between approximation and sparsity. \nJensen's inequality (logp ::::; P - 1) formulates the lower bound for the entropy term in (11) \nin the form of the multivariate Gini impurity index [16]: \n\nM \n\nM \n\n1- LP; ::::; - LPi logpi \n\n(15) \n\nwhere 0 ::::; Pi ::::; 1, Vi and L,i Pi = 1. Both forms of entropy - L,~ Pi log Pi and 1 -\nL,~ PT reach their maxima at the same values Pi == 1/ M corresponding to a uniform \ndistribution. As in the binary case, the bound can be tightened by scaling the Gini index \nwith a multiplicative factor '1 ~ 1, of which the particular value depends on M.2 The \nGiniSVM dual cost function Hg is then given by \n\nM I N N \n\nN \n\nH g = L [2 LL>'~Qlm>'7' +'YC(L (ydm ]- >'7'/C)2 - 1)] \n\n(16) \n\n. \n\n1 m \n\nm \n\nThe convex quadratic cost function (16) with constraints in (11) can now be minimized \ndirectly using standard quadratic programming techniques. The primary advantage of the \ntechnique is that it yields sparse solutions and yet approximates the logistic regression \nsolution very well [15]. \n\n2.5 Online GiniSVM Training \nFor very large data sets such as TIMIT, using a QP approach to train GiniSVM may still \nbe prohibitive even through sparsity drastically in the trained model reduces the number \nof support vectors. An on-line estimation procedure is presented, that computes each co(cid:173)\nefficient >'i in turn from single presentation of the data {x[n], ydn]} . A line search in \nthe parameter >'i and the bias bi performs stochastic steepest descent of the dual objective \nfunction (16) of the form \n\nn \n\nbi ~ bi + L>'~ \n\n1 \n\n(17) \n\n(18) \n\nwhere [x] + denotes the positive part of x. The normalization factor zn is determined by \nequation \n\nM \nL \n\nn \n[Cydn](Qnn + 2) + f dn] + 2 L \n\n\u00a3 \n\nsolved in at most M algorithmic iterations. \n\n3 Recursive FDKM Training \n\n>.f - znl + = C(Qnn + 2) + 2'1 \n\n(19) \n\nThe weights (7) in (6) are recursively estimated using an iterative procedure reminiscent \nof (but different from) expectation maximization. The procedure involves computing new \nestimates of the sequence Ctj [n - 1] to train (6) based on estimates of Pij using previous \nvalues of the parameters >.i] . The training proceeds in a series of epochs, each refining the \n\n\ftraining \n\n~t1' +fl \n\nn-2 n-1 n \n\nn-1 n \n\n1 \n\n2 \n\n:rt~r]~i' \n\nn-2 n-1 n \n\nn-K \ntime_ \n\nK \n\nFigure 2: Iterations involved in training FDKM on a trellis based on the Markov model of \nFigure I. During the initial epoch, parameters of the probabilistic model, conditioned on \nthe observed labelfor the outgoing state at time n - 1, of the state at time n are trainedfrom \nobserved labels at time n. During subsequent epochs, probability estimates of the outgoing \nstate at time n - lover increasing forward decoding depth k = 1, ... K determine weights \nassigned to data nfor training each of the probabilistic models conditioned on the outgoing \nstate. \n\nestimate of the sequence CYj[n - 1] by increasing the size of the time window (decoding \ndepth, k) over which it is obtained by the forward algorithm (1). \nThe training steps are illustrated in Figure 2 and summarized as follows: \n\n1. To bootstrap the iteration for the first training epoch (k = 1), obtain initial values \nfor CYj[n - 1] from the labels of the outgoing state, CYj [n - 1] = Yj [n - 1]. This \ncorresponds to taking the labels Yd n - 1] as true state probabilities which corre(cid:173)\nsponds to the standard procedure of using fragmented data to estimate transition \nprobabilities. \n\n2. Train logistic kernel machines, one for each outgoing class j, to estimate the pa(cid:173)\ni, j = 1, .. , S from the training data x[n] and labels Yd n ], \n\nrameters in Pij[n ], \nweighted by the sequence CYj [n - 1]. \n\n3. Re-estimate CYj [n - 1] using the forward algorithm (1) over increasing decoding \n\ndepth k, by initializing CYj [n - k] to y[n - k]. \n\n4. Re-train, increment decoding depth k, and re-estimate CYj [n - 1], until the final \n\ndecoding depth is reached (k = K). \n\nThe performance of FDKM training depends on the final decoding depth K, although ob(cid:173)\nserved variations in generalization performance for large values of K are relatively smalL \nA suitable value can be chosen a priori to match the extent of temporal dependency in the \ndata. For phoneme classification in speech, the decoding depth can be chosen according to \nthe length of a typical syllable. \nAn efficient procedure to implement the above algorithm is discussed in [15]. \n\n4 Experiments and Results \n\nThe performance of FDKM was evaluated on the full TIMIT dataset [17], consisting of \nlabeled continuous spoken utterances. The 60 phone classes presented in TIMIT were first \ncollapsed onto 39 classes according to standard folding techniques [6]. The training set \nconsisted of 6,300 sentences spoken by 63 speakers, resulting in 177,080 phone instances. \nThe test set consisted of 192 sentences spoken by 24 speakers. \nThe speech signal was first processed by a pre-emphasis filter with transfer function \n1 - 0.97z - 1. Subsequently, a 25 ms Hamming window was applied over 10 ms shifts \nto extract a sequence of phonetic segments. Cepstral coefficients were extracted from the \nsequence, combined with their first and second order time differences into a 39-dimensional \nvector. Cepstral mean subtraction and speaker normalization were subsequently applied. \n\n2Unlike the binary case (M = 2), the factor 'Y for general M cannot be chosen to match the two \n\nmaxima at Pi = 11M. \n\n\fTable 1: Performance Evaluation of FDKM (K = 10) on TIMIT \n\nMachme \n\nAccuracy \n\nInsertIOn SubstItutIOn DeletIOn Errors \n\n84 \n\n83 \n\n~82 \n28 \n1 \n~ \n380 \n:~ \n0079 \no \nu \n~78 \n\n77 \n\nV \n\n/ \n/ 1------/ \n\nI \n!---- / \n\n/ \n~ V ~ ~ \nL \n! \n\n2 \n\n4 \n\n~ Training I \n---