{"title": "New Approaches Towards Robust and Adaptive Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 751, "page_last": 757, "abstract": "", "full_text": "New Approaches Towards Robust and \n\nAdaptive Speech Recognition \n\nHerve Bourlard, Samy Bengio and Katrin Weber \n\nIDIAP \n\nP.O. Box 592, rue du Simplon 4 \n\n1920 Martigny, Switzerland \n\n{ bourlard, bengio, weber} @idiap. ch \n\nAbstract \n\nIn this paper, we discuss some new research directions in automatic \nspeech recognition (ASR), and which somewhat deviate from the \nusual approaches. More specifically, we will motivate and briefly \ndescribe new approaches based on multi-stream and multi/band \nASR. These approaches extend the standard hidden Markov model \n(HMM) based approach by assuming that the different (frequency) \nchannels representing the speech signal are processed by different \n(independent) \"experts\", each expert focusing on a different char(cid:173)\nacteristic of the signal, and that the different stream likelihoods (or \nposteriors) are combined at some (temporal) stage to yield a global \nrecognition output. As a further extension to multi-stream ASR, \nwe will finally introduce a new approach, referred to as HMM2, \nwhere the HMM emission probabilities are estimated via state spe(cid:173)\ncific feature based HMMs responsible for merging the stream infor(cid:173)\nmation and modeling their possible correlation. \n\n1 Multi-Channel Processing in ASR \n\nCurrent automatic speech recognition systems are based on (context-dependent or \ncontext-independent) phone models described in terms of a sequence of hidden \nMarkov model (HMM) states, where each HMM state is assumed to be character(cid:173)\nized by a stationary probability density function. Furthermore, time correlation, \nand consequently the dynamic of the signal, inside each HMM state is also usu(cid:173)\nally disregarded (although the use of temporal delta and delta-delta features can \ncapture some of this correlation). Consequently, only medium-term dependencies \nare captured via the topology of the HMM model, while short-term and long-term \ndependencies are usually very poorly modeled. Ideally, we want to design a partic(cid:173)\nular HMM able to accommodate multiple time-scale characteristics so that we can \ncapture phonetic properties, as well as syllable structures and {long term) invariants \nthat are more robust to noise. It is, however, clear that those different time-scale \nfeatures will also exhibit different levels of stationarity and will require different \nHMM topologies to capture their dynamics. \nThere are many potential advantages to such a multi-stream approach, including: \n\n\f1. The definition of a principled way to merge different temporal knowledge \nsources such as acoustic and visual inputs, even if the temporal sequences \nare not synchronous and do not have the same data rate - see [13] for \nfurther discussion about this. \n\n2. Possibility to incorporate multiple time resolutions (as part of a structure \n\nwith multiple unit lengths, such as phon(l and syllable). \n\n3. As a particular case of multi-stream processing, mufti-band ASR [2, 5], \ninvolving the independent processing and combination of partial frequency \nbands, have many potential advantages briefly discussed below. \n\nIn the following, we will not discuss the underlying algorithms (more or less \"com(cid:173)\nplex\" variants of Viterbi decoding), nor detailed experimental results (see, e.g., [4] \nfor recent results). Instead, we will mainly focus on the combination strategy and \ndiscuss different variants arounds the same formalism. \n\n2 Multiband-based ASR \n\n2.1 General Formalism \n\nAs a particular case of the multi-stream paradigm, we have been investigating an \nASR approach based on independent processing and combination of frequency sub(cid:173)\nbands. The general idea, as illustrated in Fig. 1, is to split the whole frequency band \n(represented in terms of critical bands) into a few subbands on which different rec(cid:173)\nognizers are independently applied. The resulting probabilities are then combined \nfor recognition later in the process at some segmental level. Starting from critical \nbands, acoustic processing is now performed independently for each frequency band, \nyielding K input streams, each being associated with a particular frequency band. \n\nSpeech Signal \n\nSpectrogram \n\n' ' !sand \n' ' ' ' ' ' ' \n\n, ___________________________________________________ _ \n\n-------------------\n' \n\nRecogDized \nWord \n\n' ' ' ' ' \n\n' \n\nFigure 1: Typical multiband-based ASR architecture. In multi-band speech recogni(cid:173)\ntion, the frequency range is split into several bands, and information in the bands is \nused for phonetic probability estimation by independent modules. These probabilities \nare then combined for recognition later in the process at some segmental level. \n\nIn this case, each of the K sub-recognizer (channel) is now using the information \ncontained in a specific frequency band Xk = { xt, x~, ... , x~, ... , x~}, where each \nx~ represents the acoustic (spectral) vector at time n in the k-th stream. \nIn the case of hybrid HMM/ ANN systems, HMM local emission (posterior) proba(cid:173)\nbilities are estimated by an artificial neural network (ANN), estimating P(qjlxn), \nwhere q3 is an HMM state and Xn = (x~, ... ,x~, ... ,x:f)t the feature vector at \ntime n. \n\n\fIn the case of multi-stream (or subband-based) HMM\u00a3 ANN systems, different ANNs \nwill compute state specific stream posteriors P(qjJxn)\u00b7 Combination ofthese local \nposteriors can then be performed at different temporal levels, and in many ways, \nincluding [2]: untrained linear or trained linear (e.g., as a function of automatically \nestimated local SNR) functions, as well as trained nonlinear functions (e.g., by using \na neural network). In the simplest case, this subband posterior recombination is \nperformed at the HMM state level, which then amounts to performing a standard \nViterbi decoding in which local {log) probabilities are obtained from a linear or \nnonlinear combination of the local subband probabilities. For example, in the initial \nsubband-based ASR, local posteriors P(qjJxn) were estimated according to: \n\nK \n\nP(qjJxn) = I:wkP(qjJx!,E>k) \n\nk=l \n\n(1) \n\nwhere, in our case, each P(qjJx!, E>k) is computed with a band-specific ANN of \nparameters E>k and with x~ (possibly with temporal context) at its input. The \nweighting factors can be assigned a uniform distribution (already performing very \nwell [2]) or be proportional to the estimated SNR. Over the last few years, several \nresults were reported showing that such a simple approach was usually more robust \nto band limited noise. \n\n2.2 Motivations and Drawbacks \n\nThe multi-band briefly discussed above has several potential advantages summarized \nhere. \nBetter robustness to band-limited noise- The signal may be impaired (e.g., \nby noise, channel characteristics, reverberation, ... ) only in some specific frequency \nbands. When recognition is based on several independent decisions from different \nfrequency subbands, the decoding of a linguistic message need not be severely im(cid:173)\npaired, as long as the remaining clean sub bands supply sufficiently reliable informa(cid:173)\ntion. This was confirmed by several experiments (see, e.g., [2]). Surprisingly, even \nwhen the combination is simply performed at the HMM state level, it is observed \nthat the multi-band approach is yielding better performance and noise robustness \nthan a regular full-band system. \nSimilar conclusions were also observed in the framework of the missing feature \ntheory [7, 9]. In this case, it was shown that, if one knows the position of the \nnoisy features, significantly better classification performance could be achieved by \ndisregarding the noisy data (using marginal distributions) or by integrating over \nall possible values of the missing data conditionally on the clean features - See \nSection 3 for further discussion about this. \nBetter modeling- Sub band modeling will usually be more robust. Indeed, since \nthe dimension of each (subband) feature space is smaller, it is easier to estimate \nreliable statistics (resulting in a more robust parametrization). Moreover, the all(cid:173)\npole modeling usually used in ASR will be more robust if performed on sub bands, \ni.e., in lower dimensional spaces, than on the full-band signal [12]. \nChannel asynchrony - Transitions between more stationary segments of speech \ndo not necessarily occur at the same time across the different frequency bands [8], \nwhich makes the piecewise stationary assumption more fragile. The subband ap(cid:173)\nproach may have the potential of relaxing the synchrony constraint inherent in \ncurrent HMM systems. \nChannel specific processing and modeling - Different recognition strate-\n\n\fgies might ultimately be applied in different subbands. For example, different \ntime/frequency resolution tradeoff's could be chosen (e.g., time resolution and width \nof analysis window depending on the frequency subband). Finally, some subbands \nmay be inherently better for certain classes of speech sounds than others. \nMajor objections ~nd drawbacks - One of the common objections [8] to this \nseparate modeling of each frequency band has been that important information in \nthe form of correlation between bands may be lost. Although this may be true, \nseveral studies [8], as well as the good recognition rates achieved on small frequency \nbands [3, 6], tend to show that most of the phonetic information is preserved in each \nfrequency band (possibly provided that we have enough temporal information). This \ndrawback will be fixed by the method presented next. \n\n3 Full Combination Subband ASR \n\nIf we know where the noise is, and based on the results obtained with missing \ndata [7, 9], impressive noise robustness can be achieved by using the marginal \ndistribution, estimating the HMM emission probability based on the clean frequency \nbands only. In our subband approach, we do not assume that we know, or detect \nexplicitly, where the noise is. Following the above developments and discussions, \nit thus seems reasonable to integrate over all possible positions of the noisy bands, \nand thus to simultaneously deal with all the L = 2K possible subband combinations \nS~ (with i = 1, ... , L, and also including the empty set) extracted from the feature \nvector Xn- Introducing the hidden variable E~, representing the statistical (exclusive \nand mutually exhaustive) event that the feature subset S~ is \"clean\" (reliable), \nand integrating over all its possible values, we can then rewrite the local posterior \nprobability as: \n\nL \n\nP(qilxn,E>) = :EP(qj,E~Ixn,E>) \n\n\u00a3=1 \n\nL \n\n= I: P(qj IE~, Xn, E>.e)P(E~Ixn) \n\n\u00a3=1 \n\nL I: P(qj IS~, E>.e)P(E~Ixn) \n\n\u00a3=1 \n\n(2) \n\nwhere P(E~Ixn) represents the relative reliability of a specific feature set. E> rep(cid:173)\nresents the whole parameter space, while E>.e denotes the set of (ANN) parameters \nused to compute the subband posteriors. \nTypically, training of the L neural nets would be done once and for all on clean \ndata, and the recognizer would then be adapted on line simply by adjusting the \nweights P(E~Ixn) (still representing a limited set of L parameters) to increase the \nglobal posteriors. This adaptation can be performed by online estimation of the \nsignal-to-noise ratio or by online, unsupervised, EM adaptation. \nWhile it is pretty easy to quickly estimate any subband likelihood or marginal \ndistribution when working with Gaussian or multi-Gaussian densities [7], straigh \nimplementation of (2) is not always tractable since it requires the use (and training) \nof L neural networks to estimate all the posteriors P(q3IS~,E>.e). However, it has \nthe advantage of not requiring the subband independence assumption [3]. \nAn interesting approximation to this \"optimal\" solution though consists in simply \nusing the neural nets that are available (K of them in the case of baseline sub band \nASR) and, re-introducing the independence assumption, to approximate all the \n\n\fother subband combination probabilities in (2), as follows [3, 4): \n\nP( \u00b7IS\u00a3 e) =P( \u00b7)II P(qilx~,ek) \n\nq3 \n\nn\u2022 \n\nI. \n\nq3 \n\nkESL \n\nP( \u00b7) \nqJ \n\n(3) \n\nExperimental results obtained from this Full. Combination approach in different \nnoisy conditions are reported in [3, 4), where the performance of this above approx(cid:173)\nimation was also compared to the \"optimal\" estimators (2). Interestingly, it was \nshown that this independence assumption did not hurt much and that the resulting \nrecognition performance was similar to the performance obtained by training and \nrecombining all possible L nets (and significantly better than the original subband \napproach). In both cases, the recognition rate and the robustness to noise were \ngreatly improved compared to the initial subband approach. This further confirms \nthat we do not seem to lose \"critically\" important information when neglecting the \ncorrelation between bands. \n\nIn the next section, we biefly introduced a further extension of this approach where \nthe segmentation into subbands is no longer done explicitly, but is achieved dynam(cid:173)\nically over time, and where the integration over all possible frequency segmentation \nis part of the same formalism. \n\n4 HMM2: Mixture of HMMs \n\nHMM emission probabilities are typically modeled through Gaussian mixtures or \nneural networks. We propose here an alternative approach, referred to as HMM2, in(cid:173)\ntegrating standard HMMs (referred to as ''temporal HMMs\") with state-dependent \nfeature-based HMMs (referred to as ''feature HMMs\") responsible for the estima(cid:173)\ntion of the emission probabilities. In this case, each feature vector Xn at time n is \nconsidered as a fixed length sequence, which has supposedly been generated by a \ntemporal HMM state specific HMM for which each state is emitting individual fea(cid:173)\nture components that are modeled by, e.g., one dimensional Gaussian mixtures. The \nfeature HMM thus looks at all possible subband segmentations and automatically \nperforms the combination of the likelihoods to yield a single emission probability. \n\nThe resulting architecture is illustrated in Figure 2. In this example, the HMM2 is \ncomposed of an HMM that handle sequences of features through time. This HMM \nis composed of 3 left-to-right connected states (q1, q2 and q3) and each state emits \na vector of features at each time step. The particularity of an HMM2 is that each \nstate uses an HMM to emit the feature vector, as if it was an ordered sequence \n(instead of a vector). In Figure 2, state q2 contains a feature HMM with 4 states \nconnected top-down. Of course, while the temporal HMM usually has a left-to-right \nstructure, the topology of the feature HMM can take many forms, which will then \nreflect the correlation being captured by the model. The feature HMM could even \nhave more states than feature components, in which case \"high-order\" correlation \ninformation could be extracted. \n\nIn [1), an EM algorithm to jointly train all the parameters of such HMM2 in order \nto maximize the data likelihood has been derived. This derivation was based on the \nfact that an HMM2 can be considered as a mixture of mixtures of distributions. \n\nWe believe that HMM2 (which includes the classical mixture of Gaussian HMMs as \na particular case) has several potential advantages, including: \n\n1. Better feature correlation modeling through the feature-based (frequency) \nHMM topology. Also, the complexity of this topology and the probability \n\n\fdensity function associated with each state easily control the number of \nparameters. \n\n2. Automatic non-linear spectral warping. In the same way the conventional \nHMM does time warping and time integration, the feature-based HMM \nperforms frequency warping and frequency integration. \n\n3. Dynamic formant trajectory modelling. As further discussed below, the \n\nHMM2 structure has the potential to extract some relevant formant struc(cid:173)\nture information, which is often considered as important to robust speech \nrecognition. \n\nTo illustrate the last point and its relationship with dynamic multi-band ASR, \nthe HMM2 models was used in [14] to extract formant-like information. All the \nparameters of HMM2 models were trained according to the above EM algorithm on \ndelta-frequency features (differences of two consecutive log Rasta PLP coefficients). \nThe feature HMM had a simple top-down topology with 4 states. After training, \nFigure 3 shows (on unseen test data) the value of the features for the phoneme iy as \nwell as the segmentation found by a Viterbi decoding along the delta-frequency axis \n(the thick black lines). At each time step, we kept the 3 positions where the delta(cid:173)\nfrequency HMM changed its state during decoding (for instance, at the first time \nframe, the HMM goes from state 1 to state 2 after the third feature). We believe \nthey contain formant-like information. In [14], it has been shown that the use of \nthat information could significantly enhance standard speech recognition systems. \n\nTime \n\nFigure 2: An HMM2: the emis(cid:173)\nsion distributions of the HMM \nare estimated by another HMM. \n\nFigure 3: Frequency deltas of log Rasta \nPLP and segmentation for an example of \nphoneme iy. \n\nAcknowledgments \n\nThe content and themes discussed in this paper largely benefited from the collabora(cid:173)\ntion with our colleagues Andrew Morris, Astrid Hagen and Herve Glotin. This work \nwas partly supported by the Swiss Federal Office for Education and Science (FOES) \nthrough the European SPHEAR (TMR, Training and Mobility of Researchers) and \nRESPITE (ESPRIT Long term Research) projects. Additionnally, Katrin Weber is \nsupported by a the Swiss National Science Foundation project MULTICHAN. \n\n\fReferences \n\n[l] Bengio, S., Bourlard, H., and Weber, K., \"An EM Algorithm for HMMs \nwith Emission Distributions Represented by HMMs,\" IDIAP Research Report, \nIDIAP-RR00-11, Martigny, Switzerland, 2000. \n\n[2] Bourlard, H. and Dupont, S., \"A new ASR approach based on independent \nprocessing and combination of partial frequency bands,\" Proc. of Intl. Conf. \non Spoken Language Processing (Philadelphia), pp. 422-425, October 1996. \n\n[3] Hagen, A., Morris, A., Bourlard, H., \"Subband-based speech recognition in \nnoisy conditions: The full combination approach,\" IDIAP Research Report \nno. IDIAP-RR-98-15, 1998. \n\n[4] Hagen, A., Morris, A., Bourlard, H., \"Different weighting schemes in the full \ncombination subbands approach for noise robust ASR,\" Proceedings of the \nWorkshop on Robust Methods for Speech Recognition in Adverse Conditions \n(Tampere, Finland), May 25-26, 1999. \n\n[5] Hermansky, H., Pavel, M., and Tribewala, S., \"Towards ASR using par(cid:173)\n\ntially corrupted speech,\" Proc. of Intl. Conf. on Spoken Language Processing \n(Philadelphia), pp. 458-461, October 1996. \n\n[6] Hermansky, H. and Sharma, S., \"Temporal patterns (TRAPS) in ASR noisy \n\nspeech,\" Proc. of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Pro(cid:173)\ncessing (Phoenix, AZ), pp. 289-292, March 1999. \n\n[7] Lippmann, R.P., Carlson, B.A., \"Using missing feature theory to actively select \nfeatures for robust speech recognition with interruptions, filtering and noise,\" \nProc. Eurospeech '97 (Rhodes, Greece, September 1997), pp. KN37-40. \n\n[8] Mirghafori, N. and Morgan, N., \"Transmissions and transitions: A study of \ntwo common assumptions in multi-band ASR,\" Intl. IEEE Conf. on Acoustics, \nSpeech, and Signal Processing, (Seattle, WA, May 1997), pp. 713-716. \n\n[9] Morris, A.C., Cooke, M.P., and Green, P.D., \"Some solutions to the missing \nfeatures problem in data classification, with application to noise robust ASR,\" \nProc. Intl. Conf on Acoustics, Speech, and Signal Processing, pp. 737-740,1998. \n[10] Morris, A.C., Hagen, A., Bourlard, H., \"The full combination subbands ap(cid:173)\nproach to noise robust HMM/ ANN-based ASR,\" Proc. of Eurospeech '99 (Bu(cid:173)\ndapest, Sep. 99). \n\n[11] Okawa, S., Bocchieri, E., Potamianos, A., \"Multi-band speech recognition in \nnoisy environment,\" Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal \nProcessing, 1998. \n\n[12] Rao, S. and Pearlman, W.A., \"Analysis of linear prediction, coding, and spec(cid:173)\ntral estimation from subbands,\" IEEE Irans. on Information Theory, vol. 42, \npp. 116Q-1178, July 1996. \n\n[13] Tomlinson, M.J., Russel, M.J., Moore, R.K., Bucklan, A.P., and Fawley, M.A., \n\"Modelling asynchrony in speech using elementary single-signal decomposi(cid:173)\ntion,\" Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing \n(Munich), pp. 1247-1250, April 1997. \n\n[14] Weber, K., Bengio, S., and Bourlard, H., \"HMM2-Extraction of Formant Fea(cid:173)\ntures and their Use for Robust ASR,\" IDIAP Research Report, IDIAP-RR00-\n42, Martigny, Switzerland, 2000. \n\n\f\f", "award": [], "sourceid": 1917, "authors": [{"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Katrin", "family_name": "Weber", "institution": null}]}