{"title": "Phonetic Speaker Recognition with Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1384, "abstract": "", "full_text": "Phonetic Speaker Recognition with Support\n\nVector Machines\n\nW. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek\n\nMIT Lincoln Laboratory\nLexington, MA 02420\n\nwcampbell,jpc,dar,daj,tleek@ll.mit.edu\n\nAbstract\n\nA recent area of signi\ufb01cant progress in speaker recognition is the use\nof high level features\u2014idiolect, phonetic relations, prosody, discourse\nstructure, etc. A speaker not only has a distinctive acoustic sound but\nuses language in a characteristic manner. Large corpora of speech data\navailable in recent years allow experimentation with long term statistics\nof phone patterns, word patterns, etc. of an individual. We propose the\nuse of support vector machines and term frequency analysis of phone se-\nquences to model a given speaker. To this end, we explore techniques\nfor text categorization applied to the problem. We derive a new kernel\nbased upon a linearization of likelihood ratio scoring. We introduce a\nnew phone-based SVM speaker recognition approach that halves the er-\nror rate of conventional phone-based approaches.\n\n1 Introduction\n\nWe consider the problem of text-independent speaker veri\ufb01cation. That is, given a claim of\nidentity and a voice sample (whose text content is a priori unknown), determine if the claim\nis correct or incorrect. Traditional speaker recognition systems use features based upon the\nspectral content (e.g., cepstral coef\ufb01cients) of the speech. Implicitly, these systems model\nthe vocal tract and its associated dynamics over a short time period. These approaches have\nbeen quite successful, see [1, 2].\n\nTraditional systems have several drawbacks. First, robustness is an issue because chan-\nnel effects can dramatically change the measured acoustics of a particular individual. For\ninstance, a system relying only on acoustics might have dif\ufb01culty con\ufb01rming that an in-\ndividual speaking on a land-line telephone is the the same as an individual speaking on\na cell phone [3]. Second, traditional systems also rely upon seemingly different methods\nthan human listeners [4]. Human listeners are aware of prosody, word choice, pronuncia-\ntion, accent, and other speech habits (laughs, etc.) when recognizing speakers. Potentially\nbecause of this use of higher level cues, human listeners are less affected by variation in\nchannel and noise types than automatic algorithms.\n\nAn exciting area of recent development pioneered by Doddington [5] is the use of \u201chigh\nlevel\u201d features for speaker recognition. In Doddington\u2019s idiolect work, word N-grams from\nconversations were used to characterize a particular speaker. More recent systems have\nused a variety of approaches involving phone sequences [6], pronunciation modeling [7],\n\n\fand prosody [8]. For this paper, we concentrate on the use of phone sequences [6]. The\nprocessing for this type of system uses acoustic information to obtain sequences of phones\nfor a given conversation and then discards the acoustic waveform. Thus, processing is done\nat the level of terms (symbols) consisting of, for example, phones or phone N-grams.\nThis paper is organized as follows. In Section 2, we discuss the NIST extended data speaker\nrecognition task. In Section 3.1, we present a method for obtaining a phone stream. Sec-\ntion 3.2 shows the structure of the SVM phonetic speaker recognition system. Section 4\ndiscusses how we construct a kernel for speaker recognition using term weighting tech-\nniques for document classi\ufb01cation. We derive a new kernel based upon a linearization of\na likelihood ratio. Finally, Section 5 shows the applications of our methods and illustrates\nthe dramatic improvement in performance possible over standard phone-based N-gram\nspeaker recognition methods.\n\n2 The NIST extended data task\n\nExperiments for the phone-based speaker recognition experiments were performed based\nupon the NIST 2003 extended data task [9]. The corpus used was a combination of phases\nII and III of the Switchboard-2 corpora [10].\n\nEach potential training utterance in the NIST extended data task consisted of a conversation\nside that was nominally of length 5 minutes recorded over a land-line telephone. Each\nconversation side consisted of a speaker having a conversation on a topic selected by an\nautomatic operator; conversations were typically between unfamiliar individuals.\n\nFor training and testing, a jacknife approach was used to increase the number of tests.\nThe data was divided into 10 splits. For training, a given split contains speakers to be\nrecognized (target speakers) and impostor speakers; the remaining splits could be used\nto construct models describing the statistics of the general population\u2014a \u201cbackground\u201d\nmodel. For example, when conducting tests on split 1, splits 2-10 could be used to construct\na background.\n\nTraining a speaker model was performed by using statistics from 1, 2, 4, 8, or 16 conver-\nsation sides. This simulated a situation where the system could use longer term statistics\nand become \u201cfamiliar\u201d with the individual; this longer term training allows one to explore\ntechniques which might mimic more what human listeners perceive about an individual\u2019s\nspeech habits. A large number of speakers and tests were available; for instance, for 8 con-\nversation training, 739 distinct target speakers were used and 11; 171 true trials and 17; 736\nfalse trials were performed. For additional information on the training/testing structure we\nrefer to the NIST extended data task description [9].\n\n3 Speaker Recognition with Phone Sequences\n\n3.1 Phone Sequence Extraction\n\nPhone sequence extraction for the speaker recognition process is performed using the\nphone recognition system (PPRLM) designed by Zissman [11] for language identi\ufb01cation.\nPPRLM uses a mel-frequency cepstral coef\ufb01cient front end with delta coef\ufb01cients. Each\nphone is modeled in a gender-dependent context-independent (monophone) manner using\na three-state hidden Markov model (HMM). Phone recognition is performed with a Viterbi\nsearch using a fully connected null-grammar network of monophones; note that no explicit\nlanguage model is used in the decoding process.\n\nThe phone recognition system encompassed multiple languages\u2014English (EG), German\n(GE), Japanese (JA), Mandarin (MA), and Spanish (SP). In earlier phonetic speaker recog-\n\n\fnition work [6], it was found that these multiple streams were useful for improving accu-\nracy. The phone recognizers were trained using the OGI multilanguage corpus which had\nbeen hand labeled by native speakers.\n\nAfter a \u201craw\u201d phone stream was obtained from PPRLM, additional processing was per-\nformed to increase robustness. First, speech activity detection marks were used to eliminate\nphone segments where no speech was present. Second, silence labels of duration greater\nthan 0:5 seconds were replaced by \u201cend start\u201d pairs. The idea in this case is to capture some\nof the ways in which a speaker interacts with others\u2014does the speaker pause frequently,\netc. Third, extraneous silence was removed at the beginning and end of the resulting seg-\nments. Finally, all phones with short duration were removed (less than 3 frames).\n\n3.2 Phonetic SVM System\n\nOur system for speaker recognition using phone sequences is shown in Figure 1. The sce-\nnario for its usage is as follows. An individual makes a claim of identity. The system then\nretrieves the SVM models of the claimed identity for each of the languages in the system.\nSpeech from the individual is then collected (a test utterance). A phone sequence is derived\nusing each of the language phone recognizers and then post-processing is performed on the\nsequence as discussed in Section 3.1. After this step, the phone sequence is vectorized by\ncomputing frequencies of N-grams\u2014this process will be discussed in Section 4. We call\nthis term calculation since we compute term types (unigram, bigram, etc.), term probabili-\nties and weightings in this step [12]. This vector is then introduced into a SVM using the\nspeaker\u2019s model in the appropriate language and a score per language is produced. Note\nthat we do not threshold the output of the SVM. These scores are then fused using a linear\nweighting to produce a \ufb01nal score for the test utterance. The \ufb01nal score is compared to a\nthreshold and a reject or accept decision is made based upon whether the score was below\nor above the threshold, respectively.\n\nAn interesting aspect of the system in Figure 1 is that it uses multiple streams of phones\nin different languages. There are several reasons for this strategy. First, the system can be\nused without modi\ufb01cation for speakers in multiple languages. Second, although not obvi-\nous, from experimentation we show that phone streams different from the language being\nspoken provide complimentary information for speaker recognition. That is, accuracy im-\nproves with these additional systems. A third point is that the system may also work in other\nlanguages not represented in the phone streams. It is known that in the case of language\nidenti\ufb01cation, language characterization can be performed even if a phone recognizer is not\navailable in that particular language [11].\n\nEG phone \nrecognizer\n\nPhone Post-\nProcessing\n\nTerm \n\nCalculation\n\nEG Speaker \nModel SVM\n\nGE phone \nrecognizer\n\nPhone Post-\nProcessing\n\nTerm \n\nCalculation\n\nGE Speaker \nModel SVM\n\nspeech\n\nO\n\nO\n\nO\n\nscore\n\nSP phone \nrecognizer\n\nPhone Post-\nProcessing\n\nTerm \n\nCalculation\n\nSP Speaker \nModel SVM\n\nFigure 1: Phonetic speaker recognition using support vector machines\n\nS\n\fTraining for the system in Figure 1 is based upon the structure of the NIST extended data\ncorpus (see Section 2). We treat each conversation side in the corpus as a \u201cdocument.\u201d From\neach of these conversation sides we derive a single (sparse) vector of weighted probabili-\nties. To train a model for a given speaker, we use a one-versus-all strategy. The speaker\u2019s\nconversations are trained to a SVM target value of +1. The conversations sides not in the\ncurrent split (see Section 2) are used as a background. That is, all conversation sides not in\nthe current split are used as the class for SVM target value of (cid:0)1. Note that this strategy\nensures that speakers that are used as impostors are \u201cunseen\u201d in the training data.\n\n4 Kernel Construction\n\nPossibly the most important aspect of the process of phonetic speaker recognition is the\nselection of the kernel for the SVM. Of particular interest is a kernel which will preserve\nthe identity cues a particular individual might present in their phone sequence. We describe\ntwo steps for kernel construction.\n\nOur \ufb01rst step of kernel construction is the selection of probabilities to describe the phone\nstream. We follow the work of [5, 6]. The basic idea is to use a \u201cbag of N-grams\u201d approach.\nFor a phone sequence, we produce N-grams by the standard transformation of the stream;\ne.g., for bigrams (2-grams) the sequence of phones, t1, t2, ..., tn, is transformed to the\nsequence of bigrams of phones t1 t2, t2 t3, ..., tn(cid:0)1 tn. We then \ufb01nd probabilities of N-\ngrams with N \ufb01xed. That is, suppose we are considering unigrams and bigrams of phones,\nand the unique unigrams and bigrams are designated d1, ..., dM and d1 d1, ... dM dM ,\nrespectively; then we calculate probabilities and joint probabilities\n\n(1)\n\n#(tk = di)\n\np(di) =\n\np(di dj ) =\n\n#(tk tl = di dj)\n\nPk #(tk = dk)\nPk;l #(ti tj = dk dl)\n\nwhere #(tk = di) indicates the number of phones in the conversation side equal to di, and\nan analogous de\ufb01nition is used for bigrams. These probabilities then become entries in a\nvector v describing the conversation side\n\nv = [p(d1)\n\n: : : p(dM ) p(d1 d1)\n\n: : :\n\np(dM dM )]t :\n\n(2)\n\nIn general, the vector v will be sparse since the conversation side will not contain all po-\ntential unigrams, bigrams, etc.\n\nA second step of kernel construction is the selection of the \u201cdocument component\u201d of\nterm weighting for the entries of the vector v in (2) and the normalization of the resulting\nvector. By term weighting we mean that for each entry, vi, of the vector v, we multiply\nby a \u201ccollection\u201d (or background) component, wi, for that entry. We tried two distinct\napproaches for term weighting.\nTFIDF weighting. The \ufb01rst is based upon the standard TFIDF approach [12, 13]. From\nthe background section of the corpus we compute the frequency of a particular N-gram\nusing conversation sides as the item analogous to a document. I.e., if we let DF (ti) be the\nnumber of conversation sides where a particular N-gram, ti, is observed, then our resulting\nterm-weighted vector has entries\n\nvi log(cid:18) # of conversation sides in background\n\nDF (ti)\n\n(cid:19) :\n\n(3)\n\nWe follow the weighting in (3) by a normalization of the vector to unit length x 7! x=kxk2.\n\n\fLog-likelihood ratio weighting. An alternate method of term weighting may be derived\nusing the following strategy. Suppose that we have two conversation sides from speakers,\nspk1 and spk2. Further suppose that the sequence of N-grams (for \ufb01xed N) in each con-\nversation side is t1,t2, ..., tn and u1, u2, ..., um respectively. We denote the unique set\nof N-grams as d1, ..., dM . We can build a \u201cmodel\u201d based upon the conversation sides for\neach speaker consisting of the probability of N-grams, p(dijspkj). We then compute the\nlikelihood ratio of the \ufb01rst conversation side as is standard in veri\ufb01cation problems [1]; a\nlinearization of the likelihood ratio computation will serve as the kernel. Proceeding,\n\np(t1; t2; : : : ; tnjspk2)\n\np(t1; : : : ; tnjbackground)\n\n=\n\nn\n\nYi=1\n\np(tijspk2)\n\np(tijbackground)\n\n(4)\n\nwhere we have made the assumption that the probabilities are independent. We then con-\nsider the log of the likelihood ratio normalized by the number of observations,\n\nn\n\n1\nn\n\nXi=1\nXj=1\n\nM\n\nscore =\n\n=\n\n=\n\nlog(cid:18)\n\np(tijspk2)\n\np(tijbackground)(cid:19)\n\n#(ti = dj)\n\nn\n\nlog(cid:18)\n\np(djjspk2)\n\np(djjbackground)(cid:19)\n\nM\n\nXj=1\n\np(djjspk1) log(cid:18)\n\np(djjspk2)\n\np(djjbackground)(cid:19) :\n\nIf we now \u201clinearize\u201d the log function in (5) by using log(x) (cid:25) x (cid:0) 1, we get\n\n(5)\n\n(6)\n\nscore (cid:25)\n\n=\n\n=\n\nM\n\nXj=1\n\nM\n\nXj=1\n\nM\n\nXj=1\n\np(djjspk1)\n\np(djjspk2)\n\np(djjbackground)\n\np(djjspk1)\n\np(djjspk2)\n\np(djjbackground)\n\n(cid:0)\n\nM\n\nXj=1\n\n(cid:0) 1\n\np(djjspk1)\n\np(djjspk1)\n\np(djjspk2)\n\npp(djjbackground)\n\npp(djjbackground)\n\n(cid:0) 1\n\nThus, (6) suggests we use a term weighting given by 1=pp(djjbackground). Note that the\n\nstrategy used for constructing a kernel is part of a general process of \ufb01nding kernels based\nupon training on one instance and testing upon another instance [2].\n\n5 Experiments\n\nExperiments were performed using the NIST extended data task \u201cv1\u201d lists (which encom-\npass the entire Switchboard 2 phase II and III corpora). Tests were performed for 1, 2, 4, 8,\nand 16 training conversations. Scoring was performed using the SVM system shown in Fig-\nure 1. Five language phone recognizers were used\u2014English (EG), German (GE), Japanese\n(JA), Mandarin (MA), and Spanish (SP). The resulting phone sequences were vectorized as\nunigram and bigram probabilities (2). Both the standard TFIDF term weighting (3) and the\nlog-likelihood ratio (TFLLR) term weighting (6) methods were used. We note that when\na term did not appear in the background, it was ignored in training and scoring. A linear\nkernel was used x (cid:1) y + 1 to compare the vectors of term weights. Training was performed\nusing the SVMTorch package [14] with c = 1. Comparisons of performance for different\nstrategies were typically done with 8 conversation training and English phone streams since\nthese were representative of overall performance.\n\n\fTable 1: Comparison of different term weighting strategies, English only scores, 8 conver-\nsation training\n\nTerm Weighting Method\n\nTFIDF\nTFLLR\n\nEER\n7.4%\n5.2%\n\nResults were compared via equal error rates (EERs)\u2014the error at the threshold which pro-\nduces equal miss and false alarm probabilities, Pmiss = Pfa. Table 1 shows the results\nfor two different weightings, TFIDF (3) and TFLLR (6), using English phones only and 8\ntraining conversations. The table illustrates that the new TFLLR weighting method is more\neffective. This may be due to the fact the IDF is too \u201csmooth\u201d; e.g., for unigrams, the IDF is\napproximately 1 since a unigram almost always appears in a given 5 minute conversation.\nAlso, alternate methods of calculating the TF component of TFIDF have not been explored\nand may yield gains compared to our formulation.\n\nWe next considered the effect on performance of the language of the phone stream for\nthe 8 conversation training case. Figure 2 shows a DET plot (a ROC plot with a special\nscale [15]) with results corresponding to the 5 language phone streams. The best perform-\ning system in the \ufb01gure is an equal fusion of all scores from the SVM outputs for each lan-\nguage and has an EER of 3:5%; other fusion weightings were not explored in detail. Note\nthat the best performing language is English, as expected. Note, though, as we indicated in\nSection 3.1 that other languages do provide signi\ufb01cant speaker recognition information.\n\nSP \n\nJA \n\nGE \n\nMA \n\nEG \n\nFused Scores \n\n)\n\n \n\n%\nn\ni\n(\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n \ns\ns\nM\n\ni\n\n  40  \n\n  20  \n\n  10  \n\n  5   \n\n  2   \n\n  1   \n\n 0.5  \n\n  0.2 \n\n  0.1 \n\n  0.1   0.2 \n\n 0.5     1   \n\n  2   \n\n  40  \n\n  5   \n\n  10  \n\n  20  \n\nFalse Alarm probability (in %)\n\nFigure 2: DET plot for the 8 conversation training case with varying languages and TFLLR\nweighting. The plot shows in order of increasing EER\u2014fused scores, EG, MA, GE, JA, SP\n\n\fTable 2: Comparison of equal error rates (EERs) for different conversation training lengths\nusing the TFLLR phonetic SVM and the standard log likelihood ratio (LLR) method\n\n# Training\n\nConversations\n\nSVM EER LLR EER SVM EER\nReduction\n\n1\n2\n4\n8\n16\n\n13.4%\n8.6%\n5.3%\n3.5%\n2.5%\n\n21.8%\n14.9%\n10.3%\n8.8%\n8.3%\n\n38%\n42%\n49%\n60%\n70%\n\nStandard\nBigram \n\nStandard\nTrigram\n\nSVM \n\n)\n\n%\n \nn\ni\n(\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \ns\ns\nM\n\ni\n\n  40  \n\n  20  \n\n  10  \n\n  5   \n\n  2   \n  1   \n 0.5  \n\n  0.2 \n  0.1 \n\n  0.1   0.2  0.5    1     2   \n\n  5      10     20  \nFalse Alarm probability (in %)\n\n  40  \n\nFigure 3: DET plot for 8 conversation training showing a comparison of the SVM approach\n(solid line) to the standard log likelihood ratio approach using bigrams (dash-dot line) and\nthe standard log likelihood ratio approach using trigrams (dashed line)\n\nTable 2 shows the effect of different training conversation lengths on the EER. As expected,\nmore training data leads to lower error rates. We also see that even for 1 training conversa-\ntion, the SVM system provides signi\ufb01cant speaker characterization ability. Figure 3 shows\nDET plots comparing the performance of the standard log likelihood ratio method [6] to\nour new SVM method using the TFLLR weighting. We show log likelihood results based\non both bigrams and trigrams; in addition, a slightly more complex model involving dis-\ncounting of probabilities is used. One can see the dramatic reduction in error, especially\napparent for low false alarm probabilities. The EERs of the standard system are 8:8% (tri-\ngrams, see Table 2) and 10:4% (bigrams), whereas our new SVM system produces an EER\nof 3:5%; thus, we have reduced the error rate by 60%.\n\n\f6 Conclusions and future work\n\nAn exciting new application of SVMs to speaker recognition was shown. By computing\nfrequencies of phones in conversations, speaker characterization was performed. A new\nkernel was introduced based on the standard method of log likelihood ratio scoring. The\nresulting SVM method reduced error rates dramatically over standard techniques.\n\nAcknowledgements\n\nThis work was sponsored by the United States Government Technical Support Working\nGroup under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclu-\nsions, and recommendations are those of the authors and are not necessarily endorsed by\nthe United States Government.\n\nReferences\n[1] Douglas A. Reynolds, T. F. Quatieri, and R. Dunn, \u201cSpeaker veri\ufb01cation using adapted Gaussian\n\nmixture models,\u201d Digital Signal Processing, vol. 10, no. 1-3, pp. 19\u201341, 2000.\n\n[2] W. M. Campbell, \u201cGeneralized linear discriminant sequence kernels for speaker recognition,\u201d\nin Proceedings of the International Conference on Acoustics Speech and Signal Processing,\n2002, pp. 161\u2013164.\n\n[3] T. F. Quatieri, D. A. Reynolds, and G. C. O\u2019Leary, \u201cEstimation of handset nonlinearity with\napplication to speaker recognition,\u201d IEEE Trans. Speech and Audio Processing, vol. 8, no. 5,\npp. 567\u2013584, 2000.\n\n[4] Astrid Schmidt-Nielsen and Thomas H. Crystal, \u201cSpeaker veri\ufb01cation by human listeners: Ex-\nperiments comparing human and machine performance using the NIST 1998 speaker evaluation\ndata,\u201d Digital Signal Processing, vol. 10, pp. 249\u2013266, 2000.\n\n[5] G. Doddington, \u201cSpeaker recognition based on idiolectal differences between speakers,\u201d in\n\nProceedings of Eurospeech, 2001, pp. 2521\u20132524.\n\n[6] Walter D. Andrews, Mary A. Kohler, Joseph P. Campbell, John J. Godfrey, and Jaime\nHern\u00b4andez-Cordero, \u201cGender-dependent phonetic refraction for speaker recognition,\u201d inPro-\nceedings of the International Conference on Acoustics Speech and Signal Processing, 2002, pp.\nI149\u2013I153.\n\n[7] David Klus\u00b4a\u02c7cek, Jir\u00b4i Navar\u00b4atil, D. A. Reynolds, and J. P. Campbell, \u201cConditional pronunciation\nmodeling in speaker detection,\u201d in Proceedings of the International Conference on Acoustics\nSpeech and Signal Processing, 2003, pp. IV\u2013804\u2013IV\u2013807.\n\n[8] Andre Adami, Radu Mihaescu, Douglas A. Reynolds, and John J. Godfrey, \u201cModeling prosodic\ndynamics for speaker recognition,\u201d inProceedings of the International Conference on Acoustics\nSpeech and Signal Processing, 2003, pp. IV\u2013788\u2013IV\u2013791.\n\n[9] M. Przybocki and A. Martin, \u201cThe NIST year 2003 speaker recognition evaluation plan,\u201d\n\nhttp://www.nist.gov/speech/tests/spk/2003/index.htm, 2003.\n\n[10] Linguistic Data Consortium, \u201cSwitchboard-2 corpora,\u201d http://www.ldc.upenn.edu.\n[11] M. Zissman, \u201cComparison of four approaches to automatic language identi\ufb01cation of telephone\n\nspeech,\u201d IEEE Trans. Speech and Audio Processing, vol. 4, no. 1, pp. 31\u201344, 1996.\n\n[12] Thorsten Joachims, Learning to Classify Text Using Support Vector Machines, Kluwer Aca-\n\ndemic Publishers, 2002.\n\n[13] G. Salton and C. Buckley, \u201cTerm weighting approaches in automatic text retrieval,\u201dInformation\n\nProcessing and Management, vol. 24, no. 5, pp. 513\u2013523, 1988.\n\n[14] Ronan Collobert and Samy Bengio, \u201cSVMTorch: Support vector machines for large-scale\n\nregression problems,\u201d Journal of Machine Learning Research, vol. 1, pp. 143\u2013160, 2001.\n\n[15] Alvin Martin, G. Doddington, T. Kamm, M. Ordowski, and Marc Przybocki, \u201cThe DET curve\nin assessment of detection task performance,\u201d inProceedings of Eurospeech, 1997, pp. 1895\u2013\n1898.\n\n\f", "award": [], "sourceid": 2523, "authors": [{"given_name": "William", "family_name": "Campbell", "institution": null}, {"given_name": "Joseph", "family_name": "Campbell", "institution": null}, {"given_name": "Douglas", "family_name": "Reynolds", "institution": null}, {"given_name": "Douglas", "family_name": "Jones", "institution": null}, {"given_name": "Timothy", "family_name": "Leek", "institution": null}]}