{"title": "Speaker Recognition Using Neural Tree Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1035, "page_last": 1042, "abstract": null, "full_text": "SPEAKER RECOGNITION USING \n\nNEURAL TREE NETWORKS \n\nKevin R. Farrell and Richard J. Marnrnone \n\nCAIP Center, Rutgers University \nCore Building, Frelinghuysen Road \n\nPiscataway, New Jersey 08855 \n\nAbstract \n\nA new classifier is presented for text-independent speaker recognition. The \nnew classifier is called the modified neural tree network (MNTN). The \nNTN is a hierarchical classifier that combines the properties of decision \ntrees and feed-forward neural networks. The MNTN differs from the stan(cid:173)\ndard NTN in that a new learning rule based on discriminant learning \nis used, which minimizes the classification error as opposed to a norm \nof the approximation error. The MNTN also uses leaf probability mea(cid:173)\nsures in addition to the class labels. The MNTN is evaluated for several \nspeaker identification experiments and is compared to multilayer percep(cid:173)\ntrons (MLPs) , decision trees, and vector quantization (VQ) classifiers. The \nVQ classifier and MNTN demonstrate comparable performance and per(cid:173)\nform significantly better than the other classifiers for this task. Addition(cid:173)\nally, the MNTN provides a logarithmic saving in retrieval time over that \nof the VQ classifier. The MNTN and VQ classifiers are also compared \nfor several speaker verification experiments where the MNTN is found to \noutperform the VQ classifier. \n\n1 \n\nINTRODUCTION \n\nAutomatic speaker recognition consists of having a machine recognize a person based \non his or her voice. Automatic speaker recognition is comprised of two categories: \nspeaker identification and speaker verification. The objective of speaker identifi(cid:173)\ncation is to identify a person within a fixed population based on a test utterance \nfrom that person. This is contrasted to speaker verification where the objective is \nto verify a person's claimed identity based on the test utterance. \n\n1035 \n\n\f1036 \n\nFarrell and Mammone \n\nSpeaker recognition systems can be either text dependent or text independent. \nText-dependent speaker recognition systems require that the speaker utter a specific \nphrase or a given password. Text-independent speaker identification systems iden(cid:173)\ntify the speaker regardless of the utterance. This paper focuses on text-independent \nspeaker identification and speaker verification tasks. \n\nA new classifier is introduced and evaluated for speaker recognition. The new clas(cid:173)\nsifier is the modified neural tree network (MNTN). The MNTN incorporates mod(cid:173)\nifications to the learning rule of the original NTN [1] and also uses leaf probability \nmeasures in addition to the class labels. Also, vector quantization (VQ) classifiers, \nmultilayer perceptrons (MLPs), and decision trees are evaluated for comparison. \n\nThis paper is organized as follows. Section 2 reviews the neural tree network and \ndiscusses the modifications. Section 3 discusses the feature extraction and classifica(cid:173)\ntion phases used here for text-independent speaker recognition. Section 4 describes \nthe database used and provides the experimental results. The summary and con(cid:173)\nclusions of the paper are given in Section 5. \n\n2 THE MODIFIED NEURAL TREE NETWORK \n\nThe NTN [1] is a hierarchical classifier that uses a tree architecture to implement \na sequential linear decision strategy. Each node at every level of the NTN divides \nthe input training vectors into a number of exclusive subsets of this data. The \nleaf nodes of the NTN partition the feature space into homogeneous subsets, i.e., \na single class at each leaf node. The NTN is recursively trained as follows. Given \na set of training data at a particular node, if all data within that node belongs to \nthe same class, the node becomes a leaf. Otherwise, the data is split into several \nsubsets, which become the children of this node. This procedure is repeated until \nall the data is completely uniform at the leaf nodes. \n\nFor each node the NTN computes the inner product of a weight vector wand an \ninput feature vector x, which should be approximately equal to the the output \nlabel y E {O,1}. Traditional learning algorithms minimize a norm of the error \n\u20ac = (y- < w, x \u00bb, such as the L2 or L1 norm. The splitting algorithm of the \nmodified NTN is based on discriminant learning [2]. Discriminant learning uses a \ncost function that minimizes the classification error. \n\nFor an M class NTN, the discriminant learning approach first defines a misclassifi(cid:173)\ncation measure d( x) as [2]: \n\nd( x) = - < Wi, X > + { M ~ 1 I) < Wj, x > t } n , \n\n1 \n\njf;i \n\n(1) \n\nwhere n is a predetermined smoothing constant. If x belongs to class i, then d(x) \nwill be negative, and if x does not belong to class i, d( x) will be positive. The \nmisclassification measure d( x) is then applied to a sigmoid to yield: \n\ng[d(x)] = \n\n1 \n1 + e-\n\nd( ). \n\nx \n\n(2) \n\nThe cost function in equation (2) is approximately zero for correct classifications \nand one for misclassifications. Hence, minimizing this cost function will tend to \n\n\fSpeaker Recognition Using Neural Tree Networks \n\n1037 \n\n, \n\\\u00b7\u00b7,~,o 0 \n\n0 \n\n\\ \n\n0 1 0./ \n, 1 ~/ \n.', ..... \n\n'. '. \n\n0 \n\n\\1 \n... \n'. 1 \n... \n\" ... 1 \n\\\\. \n\n1 \n\no~/ \n\n............. \n\n.--;r~ 1 \\ o \n\n... \n.... \n1 0 \\ \n\n1 \no \n\n1 \n\nLABEL=O \n\nLABEL= 1 \n\nLABEL = 0 \n\nLABEL = 1 \n\nCONFIDENCE c 1.0 \n\nCONFIDENCE = 0.6 \n\nCONFIDENCE c 0.8 \n\nCONFIDENCE = 0.7 \n\nFigure 1: Forward Pruning and Confidence Measures \n\nmmlmize the classification error. It is noted that for binary NTNs, the weight \nupdates obtained by the discriminant learning approach and the Ll norm of the \nerror are equivalent [3]. \n\nThe NTN training algorithm described above constructs a tree having 100% per(cid:173)\nformance on the training set. However, an NTN trained to this level may not have \noptimal generalization due to overtraining. The generalization can be improved \nby reducing the number of nodes in a tree, which is known as pruning. A tech(cid:173)\nnique known as backward pruning was recently proposed [1] for the NTN. Given a \nfully grown NTN, i.e., 100% performance on the training set, the backward pruning \nmethod uses a Lagrangian cost function to minimize the classification error and \nthe number of leaves in the tree. The method used here prunes the tree during its \ngrowth, hence it is called forward pruning. \n\nThe forward pruning algorithm consists of simply truncating the growth of the tree \nbeyond a certain level. For the leaves at the truncated level, a vote is taken and \nthe leaf is assigned the label of the majority. In addition to a label, the leaf is \nalso assigned a confidence. The confidence is computed as the ratio of the number \nof elements for the vote winner to the total number of elements. The confidence \nprovides a measure of confusion for the different regions of feature space. The \nconcept of forward pruning is illustrated in Figure 1. \n\n3 FEATURE EXTRACTION AND CLASSIFICATION \n\nThe process of feature extraction consists of obtaining characteristic parameters of \na signal to be used to classify the signal. The extraction of salient features is a \nkey step in solving any pattern recognition problem. For speaker recognition, the \nfeatures extracted from a speech signal should be invariant with regard to the desired \n\n\f1038 \n\nFarrell and Mammone \n\nSpeaker 1 \n\nyl i \n,----.. (NTN, VQ r--=---\n\n. Codebook) \n\nFeature Xi \nVector \n\nSpeaker 2 \n(NTN, VQ \nCodebook) \n\ny2 i \n..:.-=-. Decision \n\n\u2022 \n\u2022 \n\u2022 \n\nSpeaker N \n~ (NTN, VQ \nCodebook) \n\nyN i \nr:----=--\n\nSpeaker \nIdentity \n\nor \n\nAuthenticity \n\nFigure 2: Classifier Structure for Speaker Recognition \n\nspeaker while exhibiting a large distance to any imposter. Cepstral coefficients are \ncommonly used for speaker recognition [4] and shall be considered here to evaluate \nthe classifiers. \n\nThe classification stage of text-independent speaker recognition is typically imple(cid:173)\nmented by modeling each speaker with an individual classifier. The classifier struc(cid:173)\nture for speaker recognition is illustrated in Figure 2. Given a specific feature vector, \neach speaker model associates a number corresponding to the degree of match with \nthat speaker. The stream of numbers obtained for a set of feature vectors can be \nused to obtain a likelihood score for each speaker model. For speaker identification, \nthe feature vectors for the test utterance are applied to all speaker models and the \ncorresponding likelihood scores are computed. The speaker is selected as having \nthe largest score. For speaker verification, the feature vectors are applied only to \nthe speaker model for the speaker to be verified. If the likelihood score exceeds a \nthreshold, the speaker is verified or else is rejected. \n\nThe classifiers for the individual speaker models are trained using either supervised \nor unsupervised training methods. For supervised training methods the classifier for \neach speaker model is presented with the data for all speakers. Here, the extracted \nfeature vectors for that speaker are labeled as \"one\" and the extracted feature vec(cid:173)\ntors for everyone else are labeled as \"zero\" . The supervised classifiers considered \nhere are the multilayer perceptron (MLP), decision trees, and modified neural tree \nnetwork (MNTN). For unsupervised training methods each speaker model is pre(cid:173)\nsented with only the extracted feature vectors for that speaker. This data can \nthen be clustered to determine a set of centroids that are representative of that \nspeaker. The unsupervised classifiers evaluated here are the full-search and tree(cid:173)\nstructure vector quantization classifiers, henceforth denoted as FSVQ and TSVQ . \nSpeaker models based on supervised training capture the differences of that speaker \nto other speakers, whereas models based on unsupervised training use a similarity \nmeasure. \n\nSpecifically, a trained NTN can be applied to speaker recognition as follows. Given \na sequence of feature vectors x from the test utterance and a trained NTN for \n\n\fSpeaker Recognition Using Neural Tree Networks \n\n1039 \n\nspeaker Si, the corresponding speaker score is found as the \"hit\" ratio: \n\nHere. M is the number of vectors classified as \"one\" and N is the number of vec(cid:173)\ntors classified as \"zero\" . The modified NTN computes a hit ratio weighed by the \nconfidence scores: \n\n(3) \n\nPMNTN(xISi) = \",N \n\n0 \n\n\",M \n1 \nL..Jj=l Cj \n\nL..JJ=l Cj + L..JJ=l cj \n\n\",M \n\nl' \n\n(4) \n\nwhere c1 and cO are the confidence scores for the speaker and antispeaker, respec(cid:173)\ntively. These scores can be used for decisions regarding identification or verification. \n\n4 EXPERIMENTAL RESULTS \n\n4.1 Database \n\nThe database considered for the speaker identification and verification experiments \nis a subset of the DARPA TIMIT database. This set represents 38 speakers of the \nsame (New England) dialect. The preprocessing of the TIMIT speech data consists \nof several steps. First, the speech is downsampled from 16KHz to 8 KHz sampling \nfrequency. The downsampling is performed to obtain a toll quality signal. The \nspeech data is processed by a silence removing algorithm followed by the application \nof a pre-emphasis filter H(z) = 1-0.95z- 1 . A 30 ms Hamming window is applied to \nthe speech every 10 ms. A twelfth order linear predictive (LP) analysis is performed \nfor each speech frame. The features consist of the twelve cepstral coefficients derived \nfrom this LP polynomial. \n\nThere are 10 utterances for each speaker in the selected set. Five of the utterances \nare concatenated and used for training. The remaining five sentences are used \nindividually for testing. The duration of the training data ranges from 7 to 13 \nseconds per speaker and the duration of each test utterance ranges from 0.7 to 3.2 \nseconds. \n\n4.2 Speaker Identification \n\nThe first experiment is for closed set speaker identification using 10 and 20 speakers \nfrom the TIMIT New England dialect. The identification is closed set in that the \nspeaker is assumed to be one of the 10 or 20 speakers, i.e., no \"none of the above\" \noption. The NTN, MLP [5], and VQ [4] classifiers are each evaluated on this data in \naddition to the ID3 [6], C4 [7], CART [8], and Bayesian [9] decision trees. The VQ \nclassifier is trained using a K-means algorithm and tested for codebook sizes varying \nfrom 16 to 128 centroids. The MNTN used here is pruned at levels ranging from the \nfourth through seventh. The MLP is trained using the backpropagation algorithm \n[10] for architectures having 16, 32, and 64 hidden nodes (within one hidden layer). \nThe results are summarized in Table 1. The * denotes that the CART tree could \nnot be grown for the 20 speaker experiment due to memory limitations. \n\n\f1040 \n\nFarrell and Mammone \n\nClassifier \n\nID3 \n\nCART \n\nC4 \n\nTable 1: Speaker Identification Experiments \n\n4.3 Speaker Verification \n\nThe FSVQ classifier and MNTN are evaluated next for speaker verification. The \nfirst speaker verification experiment consists of 10 speakers and 10 imposters (i.e., \npeople not used in the training set). The second speaker verification experiment \nuses 20 enrolled speakers and 18 imposters. The MNTN is pruned at the seventh \nlevel (128 leaves) and the FSVQ classifier has a codebook size of 128 entries. \n\nSpeaker verification performance can be enhanced by using a technique known as \ncohort normalization [11]. Traditional verification systems accept a speaker if: \n\np(XII) > T(I), \n\n(5) \nwhere p( X I I) is the likelihood that the sequence of feature vectors X was generated \nby speaker I and T( I) is the corresponding likelihood threshold. Instead of using \nthe fixed threshold criteria in equation (5), an adaptive threshold can be used via \nthe likelihood measure: \n\nP(XII) \nP(XII) > \n\nT(I). \n\n(6) \n\nHere, the speaker score is first normalized by the probability that the feature vectors \nX were generated by a speaker other than I. The likelihood p(XII) can be estimated \nwith the scores of the speaker models that are closest to I, denoted as 1's cohorts \n[11]. This estimate can consist of a maximum, minimum, average, etc., depending \non the classifier used. \n\nThe threshold for the VQ and MNTN likelihood scores are varied from the point \nof 0% false acceptance to 0% false rejection to yield the operating curves shown \nin Figures 3 and 4 for the 10 and 20 speaker populations, respectively. Note that \nall operating curves presented in this section for speaker verification represent the \nposterior performance of the classifiers, given the speaker and imposter scores. Here \nit can be seen that the MNTN and VQ classifiers are both improved by the cohort \nnormalized scores. The equal error rates for the MNTN and VQ classifier are \nsummarized in Table 2. \n\nFor both experiments (10 and 20 speakers), the MNTN provides better performance \nthan the VQ classifier, both with and without cohort normalization, for most of the \noperating curve. \n\n\fSpeaker Recognition Using Neural Tree Networks \n\n1041 \n\nMNTN \n\n10 Speakers \n20 Speakers \n\nTable 2: Equal error rates for speaker verification \n\n0.35r----~-----r---_r_---..___--___,---_, \n\nSpeaker Verification (10 speakers) \n\n- -_ .......... : ............ .. .. : .......... ... ... : ...... ....... ... ; ......... ....... ; ....... ...... . \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nI \n\n\u2022 \n\n. .. . , ... ..... ~ ....... -.. . _ ... ' :' .......... ...... : .............. .. : ............ .... ~ .. \" .......... . \n\n\u00b7 \n\u00b7 \n\u00b7 \n.\n. . \n.\n.\n\n. \n: \n. \n. \n. \n. \n\n-+ va \n\n~ \n\u00b7 \n: -. va with cohort \n. \n\n~ \n. \n. \n\n. \n. \n. \n. \n. \n. \n\n(]) \nVl \n~ 0.15 \nCL \n\n0.1 \n\n0.05 \n\n0.01 \n\n0.02 \n\n0.03 \n\nP(Falsa Accept} \n\n0.04 \n\n0.05 \n\n0.06 \n\nFigure 3: Speaker Verification (10 Speakers) \n\n0.45,------r----.---_r_---..------,-----, \n\nSpeaker Verification (20 speakers) \n\n0 .4 \n\n... ..... . .. ... : .. ..... .. . __ ... . ; ... ....... ..... ; .... .. ..... .. . -. ~ - .- .... .. ...... : .......... .... . \n\n0.35 ....... .... .. .. ;. .. .. .. .. ... .... : ... ..... .... ... , .. .... \u00b7\u00b7 \u00b7\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7\u00b7:\u00b7 \u00b7\u00b7\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 i\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \u00b7\u00b7 \u00b7\u00b7 \u00b7\u00b7 \u00b7\u00b7 \n\n0.3 i' .... \u00b7\u00b7 .... .. : .......... \u00b7 .. \u00b7 .. : ...... .. \u00b7 .. \u00b7 .. \u00b7:\u00b7\u00b7 .. ............ f .............. \u00b7~ .. \u00b7 .. \u00b7\u00b7 ....... . \n\nn . \n10.25 ~: .. .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7j\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 ...... \u00b7f .. \u00b7 .... \u00b7\u00b7 .. \u00b7\u00b7 .. ~\u00b7::\u00b7MNI~\u00b71i!~ .. ~~.9n\u00b7\n\n, \n\n~ \n\n: -+ va \n: - va with cohort \n. -. MNTN \n\n: \n: \n. \n\u00b7 \u00b7i\u00b7\u00b7 \u00b7 \u00b7\u00b7\u00b7 ...... \u00b7\u00b7\u00b7 \n. \n\n\u00b7 \n\u00b7 \n.\n, \n.\n,\n: \n: \n. \n\n\u00b7 \n: \n\n. \n. \n. \n, \n. \n. \n~ \n: \n. \n\n.\n: \n\n. \n. \n. \n.\n. . \n\" \n\n. \n. \n. \n. \n\n. \n. \n. \n\n. \n: \n\n\u00b7 \n\u00b7 \n\n0.1 \n\n0.05 \n\n... .... --.~ ... ..... . -_ ... , .;. . ..... . .... . .. ~ .... .......... . \n\n. \n. \n\noL-----~~~~~--~--~~====~~--~ \no \n0.12 \n\n0.06 \n\n0.02 \n\n0.08 \n\n0.04 \n\n0.1 \n\nP(Faisa Accept) \n\nFigure 4: Speaker Verification (20 Speakers) \n\n\f1042 \n\nFarrell and Mammone \n\n5 CONCLUSION \n\nA new classifier called the modified NTN is examined for text-independent speaker \nrecognition. The performance of the MNTN is evaluated for several speaker recog(cid:173)\nnition experiments using various sized speaker sets from a 38 speaker corpus. The \nfeatures used to evaluate the classifiers are the LP-derived cepstrum. The MNTN is \ncompared to full-search and tree-structured VQ classifiers, multi-layer perceptrons, \nand decision trees. The FSVQ and MNTN classifiers both demonstrate equivalent \nperformance for the speaker identification experiments and outperform the other \nclassifiers. For speaker verification, the MNTN consistently outperforms the FSVQ \nclassifier. In addition to performance advantages for speaker verification, the MNTN \nalso demonstrates a logarithmic saving in retrieval time over that of the FSVQ clas(cid:173)\nsifier. This computational advantage can be obtained by using TSVQ, although \nTSVQ will reduce the performance with respect to FSVQ. \n\n6 ACKNOWLEDGEMENTS \n\nThe authors gratefully acknowledge the support of Rome Laboratories, Contract \nNo. F30602-91-C-OI20. The decision tree simulations utilized the IND package \ndeveloped by W. Buntine of NASA. \n\nReferences \n\n[1] A. Sankar and R.J. Mammone. Growing and pruning neural tree networks. \n\nIEEE Transactions on Computers, C-42:221-229, March 1993. \n\n[2] S. Katagiri, B.H J uang, and A. Biemo Discriminative feature extraction. In \nArtificial Neural Networks for Speech and Vision Processing , edited by R.J. \nMammone. Chapman and Hall, 1993. \n\n[3] K.R. Farrell. Speaker Recognition Using the Modified Neural Tree Network. \n\nPhD thesis, Rutgers University, Oct. 1993. \n\n[4] F.K. Soong, A.E. Rosenberg, L.R. Rabiner, and B.H. Juang. A vector quanti(cid:173)\n\nzation approach to speaker recognition. In Proceedings ICASSP, 1985. \n\n[5] J. Oglesby and J .S. Mason. Optimization of neural models for speaker identi(cid:173)\n\nfication. In Proceedings ICASSP, pages 261-264, 1990. \n\n[6] J. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. \n[7] J. Quinlan. Simplifying decision trees in Knowledge Acquisition for K now ledge(cid:173)\nBased Systems, by G. Gaines and J. Boose. Academic Press, London, 1988. \n\n[8] 1. Breiman, J .H. Friedman, R.A. Olshen, and C.J. Stone. Classification and \n\nRegression Trees. Wadsworth international group, Belmont, CA, 1984. \n\n[9] W. Buntine. Learning classification trees. Statistics and Computing, 2:63-73, \n\n1992. \n\n[10] D.E. Rumelhart and J .L. McClelland. Parallel Distributed Processing. MIT \n\nCambridge Press, Cambridge, Ma, 1986. \n\n[11] A.E. Rosenberg, J. Delong, C.H. Lee, B.H. Juang, and F.K. Soong. The use of \ncohort normalized scores for speaker recognition. In Proc. ICSLP, Oct. 1992. \n\n\f", "award": [], "sourceid": 861, "authors": [{"given_name": "Kevin", "family_name": "Farrell", "institution": null}, {"given_name": "Richard", "family_name": "Mammone", "institution": null}]}