{"title": "Exploratory Feature Extraction in Speech Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 247, "abstract": null, "full_text": "Exploratory Feature Extraction in Speech Signals \n\nNathan Intrator \n\nCenter for Neural Science \n\nBrown U ni versity \n\nProvidence, RI 02912 \n\nAbstract \n\nA novel unsupervised neural network for dimensionality reduction which \nseeks directions emphasizing multimodality is presented, and its connec(cid:173)\ntion to exploratory projection pursuit methods is discussed. This leads to \na new statistical insight to the synaptic modification equations governing \nlearning in Bienenstock, Cooper, and Munro (BCM) neurons (1982). \nThe importance of a dimensionality reduction principle based solely on \ndistinguishing features, is demonstrated using a linguistically motivated \nphoneme recognition experiment, and compared with feature extraction \nusing back-propagation network. \n\n1 \n\nIntroduction \n\nDue to the curse of dimensionality (Bellman, 1961) it is desirable to extract fea(cid:173)\ntures from a high dimensional data space before attempting a classification. How to \nperform this feature extraction/dimensionality reduction is not that clear. A first \nsimplification is to consider only features defined by linear (or semi-linear) projec(cid:173)\ntions of high dimensional data. This class of features is used in projection pursuit \nmethods (see review in Huber, 1985). \n\nEven after this simplification, it is still difficult to characterize what interesting \nprojections are, although it is easy to point at projections that are uninteresting. \nA statement that has recently been made precise by Diaconis and Freedman (1984) \nsays that for most high-dimensional clouds, most low-dimensional projections are \napproximately normal. This finding suggests that the important information in the \ndata is conveyed in those directions whose single dimensional projected distribution \nis far from Gaussian, especially at the center of the distribution. Friedman (1987) \n\n241 \n\n\f242 \n\nIntrator \n\nargues that the most computationally attractive measures for deviation from nor(cid:173)\nmality (projection indices) are based on polynomial moments. However they very \nheavily emphasize departure from normality in the tails of the distribution (Huber, \n1985). Second order polynomials (measuring the variance - principal components) \nare not sufficient in characterizing the important features of a distribution (see \nexample in Duda & Hart (1973) p. 212), therefore higher order polynomials are \nneeded. We shall be using the observation that high dimensional clusters trans(cid:173)\nlate to multimodallow dimensional projections, and if we are after such structures \nmeasuring multimodality defines an interesting projection. In some special cases, \nwhere the data is known in advance to be bi-modal, it is relatively straightforward \nto define a good projection index (Hinton & Nowlan, 1990). When the structure \nis not known in advance, defining a general multi modal measure of the projected \ndata is not straight forward, and will be discussed in this paper. \n\nThere are cases in which it is desirable to make the projection index invariant \nunder certain transformations, and maybe even remove second order structure (see \nHuber, 1985) for desirable invariant properties of projection indices) .. In such cases \nit is possible to make such transformations before hand (Friedman, 1987), and then \nassume that the data possesses these invariant properties already. \n\n2 Feature Extraction using ANN \n\nIn this section, the intuitive idea presented above is used to form a statistically \nplausible objective function whose minimization will be those projections having a \nsingle dimensional projected distribution that is far from Gaussian. This is done \nusing a loss function whose expected value leads to the desired projection index. \nMathematical details are given in Intrator (1990). \n\nBefore presenting this loss function, let us review some necessary notations and as(cid:173)\nsumptions. Consider a neuron with input vector x = (Xl, ... , :r N), synaptic weights \nvector m = (ml' ... , mN), both in RN , and activity (in the linear region) c = x . m. \nDefine the threshold em = E[(x . m)2], and the functions \u00a2(c, em) = c2 - ~cem, \n\u00a2(c, em) = c2 _ icem. The \u00a2 function has been suggested as a biologically plausible \nsynaptic modification function that explains visual cortical plasticity (Bienenstock, \nCooper and Munro, 1982). Note that at this point c represents the linear projection \nof x onto m, and we seek an optimal projection in some sense. \n\nWe want to base our projection index on polynomial moments of low order, and \nto use the fact that bimodal distribution is already interesting, and any additional \nmode should make the distribution even more interesting. With this in mind, con(cid:173)\nsider the following family of loss functions which depend on the synaptic weight \nvector and on the input x; \n\nThe motivation for this loss function can be seen in the following graph, which \nrepresents the \u00a2 function and the associated loss function Lm (x). For simplicity \nthe loss for a fixed threshold em and synaptic vector m can be written as Lm(c) = \n-ic2(c - em), where c = (x\u00b7 m). \n\n\fExploratory Feature Extraction in Speech Signals \n\n243 \n\nTllI~ qlA:\\D LOSS Ft;:\\CIlO:\\S \n\nl.Jc) \n\nFigure 1: The function \u00a2 and the loss functions for a fixed m and em. \n\nThe graph of the loss function shows that for any fixed m and em, the loss is \nsmall for a given input x, when either (x .111.) is close to zero, or when (x . m) is \nlarger than iem . Moreover, the loss function remains negative for (x\u00b7 m) > iem , \ntherefore, any kind of distribution at the right hand side of ~em is possible, and \nthe preferred ones are those which are concentratt'd further away from ~em. \n\nWe must still show why it is not possible that a minimizer of the average loss will be \nsuch that all the mass of the distribution will be concentrated in one of the regions. \nRoughly speaking, this can not happen because the threshold em is dynamic and \ndepends on the projections in a nonlinear way, namely, em = E(x . m)2. This \nimplies that em will always move itself to a stable point such that the distribution \nwill not be concentrated at only one of its sides. This yields that the part of the \ndistribution for c < ~em has a high loss, making those distributions in which the \ndistribution for c < ~em has its mode at zero more plausible. \nThe risk (expected value of the loss) is given by: \n\nRm = -~ {E[(x .111.)3] - E2[(x\u00b7 m?]}. \n\n3 \n\nSince the risk is continuously differentiable, its minimization can be achieved via a \ngradient descent method with respect to m, namely: \n\na \n\ndm \n-d t = - -;;;--Rm = J1 E[\u00a2(x\u00b7 m, em)Xi]. \nt \n\nV7ni \n\nThe resulting differential equations suggest a modified version of the law governing \nsynaptic weight modification in the BCM theory for learning and memory (Bienen(cid:173)\nstock, Cooper and Munro, 1982). This theory was presented to account for various \nexperimental results in visual cortical plasticity. The biological relevance of the \ntheory has been extensively studied (Soul et al., 1986; Bear et al., 1987; Cooper et \naI., 1987; Bear et al., 1988), and it was shown that the theory is in agreement with \nthe classical deprivation experiments (Clothioux et al., 1990). \n\nThe fact that the distribution has part of its mass on both sides of ~em makes this \nloss a plausible projection index that seeks multimodalities. However, we still need \n\n\f244 \n\nIntrator \n\nto reduce the sensitivity of the projection index to outliers, and for full generality, \nallow any projected distribution to be shifted so that the part of the distribution \nthat satisfies c < ~em will have its mode at zero. The over-sensitivity to outliers \nis addressed by considering a nonlinear neuron in which the neuron's activity is \ndefined to be C = q(x . m), where q usually represents a smooth sigmoidal function. \nA more general definition that would allow symmetry breaking of the projected \ndistributions, will provide solution to the second problem raised above, and is still \nconsistent with the statistical formulation, is c = q(x . m - a), for an arbitrary \nthreshold a which can be found by using gradient descent as well. For the nonlinear \nneuron, em is defined to be em = E[q2(x . m)]. \nBased on this formulation, a network of Q identical nodes may be constructed. All \nthe neurons in this network receive the same input and inhibit each other, so as \nto extract several features in parallel. A similar network has been studied in the \ncontext of mean field theory by Scofield and Cooper (1985). The activity of neuron \nk in the network is defined as Ck = q(x . mk - ak), where mk is the synaptic weight \nvector of neuron k, and ak is its threshold. The inhibited activity and threshold of \nthe k'th neuron are given by Ck = Ck -\nWe omit the derivation of the synaptic modification equations which is similar to \nthe one for a single neuron, and present only the resulting modification equations \nfor a synaptic vector mk in a lateral inhibition network of nonlinear neurons: \n\n17 E}#k Cj, e~ = E[c~]. \n\nmk = -11 E{\u00a2(Ck' e~:J(q'(Ck) -17 Lq'(Cj})x}. \n\nj#k \n\nThe lateral inhibition network performs a direct search of Q-dimensional projections \ntogether, and therefore may find a richer structure that a stepwise approach may \nmiss, e.g. see example 14.1 Huber (1985). \n\n3 Conlparison with other feature extraction nlethods \n\nWhen dealing with a classification problem, the interesting features are those that \ndistinguish between classes. The network presented above has been shown to seek \nmultimodality in the projected distributions, which translates to clusters in the \noriginal space, and therefore to find those directions that make a distinction between \ndifferent sets in the training data. \n\nIn this section we compare classification performance of a network that performs \ndimensionality reduction (before the classification) based upon multimodality, and \na network that performs dimensionality reduction based upon minimization of mis(cid:173)\nclassification error (using back-propagation with MSE criterion). This is done using \na phoneme classification experiment whose linguistic motivation is described below. \nIn the latter we regard the hidden units representation as a new reduced feature \nrepresentation of the input space. Classification on the new feature space was done \nusing back-propagation 1 \n\n1 See Intrator (1990) for comparison with principal components feature extraction and \n\nwith k-NN as a classifier \n\n\fExploratory Feature Extraction in Speech Signals \n\n245 \n\nConsider the six stop consonants [p,k,t,b,g,dJ, which have been a subject of recent \nresearch in evaluating neural networks for phoneme recognition (see review in Lipp(cid:173)\nmann, 1989). According to phonetic feature theory, these stops posses several com(cid:173)\nmon features, but only two distinguishing phonetic features, place of articulation \nand voicing (see Blumstein & Lieberman 1984, for a review and related references \non phonetic feature theory). This theory suggests an experiment in which features \nextracted from unvoiced stops can be used to distinguish place of articulation in \nvoiced stops as well. It is of interest if these features can be found from a single \nspeaker, how sensitive they are to voicing and whether they are speaker invariant. \n\nThe speech data consists of 20 consecutive time windows of 32msec with 30msec \noverlap, aligned to the beginning of the burst. In each time window, a set of 22 \nenergy levels is computed. These energy levels correspond to Zwicker critical band \nfilters (Zwicker, 1961). The consonant-vowel (CV) pairs were pronounced in isola(cid:173)\ntion by native American speakers (two male BSS and LTN, and one female JES.) \nAdditional details on biologicalmotivatioll for the preprocessing, and linguistic mo(cid:173)\ntivation related to child language acquisition can be found in Seebach (1990), and \nSeebach and Intrator (1991). An average (over 25 tokens) of the six stop consonants \nfollowed by the vowel [aJ is presented in Figure 2. All the images are smoothened \nusing a moving average. One can see some similarities between the voiced and \nunvoiced stops especially in the upper left corner of the image (high frequencies be(cid:173)\nginning of the burst) and the radical difference between them in the low frequencies. \n\nFigure 2: An average of the six stop consonants followed by the vowel raj. \nTheir order from left to right [paJ [baJ [kaJ [gal [taJ [da]. Time increases \nfrom the burst release on the X axis, and frequency increases on the Y axis. \n\nIn the experiments reported here, 5 features were extracted from the 440 dimen(cid:173)\nsion original space. Although the dimensionality reduction methods were trained \nonly with the unvoiced tokens of a single speaker, the classifier was trained on (5 \ndimensional) voiced and unvoiced data from the other speakers as well. \n\nThe classification results, which are summarized in table 1, show that the back(cid:173)\npropagation network does well in finding structure useful for classification of the \ntrained data, but this structure is more sensitive to voicing. Classification results \nusing a BCM network suggest that, for this specific task, structure that is less \nsensitive to voicing can be extracted, even though voic.ing has significant effects \non the speech signal itself. The results also suggest that these features are more \nspeaker invariant. \n\n\f246 \n\nInuator \n\nPlace of Articulation Classification JB-P) \n\nBSS /p,k,t/ \nBSS /b,g,d/ \nLTN /p,k,t/ \nLTN /b,g,d/ \nJES (Both) \n\nB-P \n100 \n83.4 \n95.6 \n78.3 \n88.0 \n\nBCM \n100 \n94.7 \n97.7 \n93.2 \n99.4 \n\nTable 1: Percentage of correct classification of place of articulation in voiced \nand unvoiced stops. \n\nFigure 3 : Synaptic weight images ofthe 5 hidden units of back-propagation \n(top), and by the 5 BCM neurons (bottom). \n\nThe difference in performance between the two feature extractors may be partially \nexplained by looking at the synaptic weight vectors (images) extracted by both \nmethod: For the back-propagation feature extraction it can be seen that although \n5 units were used, fewer number of features were extracted. One of the main \ndistinction between the unvoiced stops in the training set is the high frequency burst \nat the beginning of the consonant (the upper left corner). The back-propagation \nmethod concentrated mainly on this feature, probably because it is sufficient to base \nthe recognition of the training set on this feature, and the fact that training stops \nwhen misclassification error falls to zero. On the other hand, the BCM method does \nnot try to reduce the misclassificaion error and is able to find a richer, linguistically \nmeaningful structure, containing burst locations and format tracking of the three \ndifferent stops that allowed a better generalization to other speakers and to voiced \nstops. \n\nThe network and its training paradigm present a different approach to speaker \nindependent speech recognition. In this approach the speaker variability problem \nis addressed by training a network that concentrates mainly on the distinguishing \nfeatures of a single speaker, as opposed to training a network that concentrates on \nboth the distinguishing and common features, on multi-speaker data. \n\nAcknowledgements \n\nI wish to thank Leon N Cooper for suggesting the problem and for providing many \nhelpful hints and insights. Geoff Hinton made invaluable comments. The appli(cid:173)\ncation of BCM to speech is discussed in more detail in Seebach (1990) and in a \n\n\fExploratory Feature Extraction in Speech Signals \n\n247 \n\nforthcoming article (Seebach and Intrator, 1991). Research was supported by the \nNational Science Foundation, the Army Research Office, and the Office of Naval \nResearch. \n\nReferences \n\nBellman, R. E. (1961) Adaptive Control Processes, Princeton, NJ, Princeton Uni(cid:173)\nversity Press. \n\nBienenstock, E. L., L. N Cooper, and P.W. Munro (1982) Theory for the devel(cid:173)\nopment of neuron selectivity: orientation specificity and binocular interaction in \nvisual cortex. J.Neurosci. 2:32-48 \n\nBear, M. F., L. N Cooper, and F. F. Ebner (1987) A Physiological Basis for a \nTheory of Synapse Modification. Science 237:42-48 \n\nDiaconis, P, and D. Freedman (1984) Asymptotics of Graphical Projection Pursuit. \nThe Annals of Statistics, 12 793-815. \n\nFriedman, J. H. (1987) Exploratory Projection Pursuit. Journal of the American \nStatistical Association 82-397:249-266 \n\nHinton, G. E. and S. J. Nowlan (1990) The bootstrap Widrow-Hoffrule as a cluster(cid:173)\nformation algorithm. Neural Computation. \nHuber P. J. (1985) Projection Pursuit. The Annal. of Sta.t. 13:435-475 \n\nIntrator N. (1990) A Neural Network For Feature Extraction. In D. S. Touret(cid:173)\nzky (ed.), Advances in Neural Information Processing System,s 2. San Mateo, CA: \nMorgan Kaufmann. \n\nLippmann, R. P. (1989) Review of Neural Networks for Speech Recognition. Neural \nComputation 1, 1-38. \n\nReilly, D. L., C.L. Scofield, L. N Cooper and C. Elbaum (1988) GENSEP: a multiple \nneural network with modifiable network topology. \nINNS Conference on Neural \nNetworks. \n\nSaul, A. and E. E. Clothiaux, 1986) Modeling and Simulation II: Simulation of \na Model for Development of Visual Cortical specificity. J. of Electrophysiological \nTechniques, 13:279-306 \n\nScofield, C. L. and L. N Cooper (1985) Development and properties of neural net(cid:173)\nworks. Contemp. Phys. 26:125-145 \n\nSeebach, B. S. (1990) Evidence for the Development of Phonetic Property Detec(cid:173)\ntors in a Neural Net without Innate Knowledge of Linguistic Structure. Ph.D. \nDissertation Brown University. \n\nDuda R. O. and P. E. Hart (19;3) Pattern classification and scene analysis John \nWiley, New York \n\nZwicker E. (1961) Subdivision of the audible frequency range into critical bands \n(Frequenzgruppen) Journal of the Acoustical Society of America 33:248 \n\n\f", "award": [], "sourceid": 320, "authors": [{"given_name": "Nathan", "family_name": "Intrator", "institution": null}]}