{"title": "A Sequence Kernel and its Application to Speaker Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1157, "page_last": 1163, "abstract": null, "full_text": "A Sequence Kernel and its Application to\n\nSpeaker Recognition\n\nWilliam M. Campbell\n\nMotorola Human Interface Lab\n\n7700 S. River Parkway\n\nTempe, AZ 85284\n\nBill.Campbell@motorola.com\n\nAbstract\n\nA novel approach for comparing sequences of observations using an\nexplicit-expansion kernel is demonstrated. The kernel is derived using\nthe assumption of the independence of the sequence of observations and\na mean-squared error training criterion. The use of an explicit expan-\nsion kernel reduces classi\ufb01er model size and computation dramatically,\nresulting in model sizes and computation one-hundred times smaller in\nour application. The explicit expansion also preserves the computational\nadvantages of an earlier architecture based on mean-squared error train-\ning. Training using standard support vector machine methodology gives\naccuracy that signi\ufb01cantly exceeds the performance of state-of-the-art\nmean-squared error training for a speaker recognition task.\n\n1 Introduction\n\nComparison of sequences of observations is a natural and necessary operation in speech\napplications. Several recent approaches using support vector machines (SVM\u2019s) have been\nproposed in the literature. The \ufb01rst set of approaches attempts to model emission proba-\nbilities for hidden Markov models [1, 2]. This approach has been moderately successful\nin reducing error rates, but suffers from several problems. First, large training sets result\nin long training times for support vector methods. Second, the emission probabilities must\nbe approximated [3], since the output of the support vector machine is not a probability.\nA more recent method for comparing sequences is based on the Fisher kernel proposed by\nJaakkola and Haussler [4]. This approach has been explored for speech recognition in [5].\nThe application to speaker recognition is detailed in [6]. We propose an alternative kernel\nbased upon polynomial classi\ufb01ers and the associated mean-squared error (MSE) training\ncriterion [7]. The advantage of this kernel is that it preserves the structure of the classi\ufb01er\nin [7] which is both computationally and memory ef\ufb01cient.\n\nWe consider the application of text-independent speaker recognition; i.e., determining or\nverifying the identity of an individual through voice characteristics. Text-independent\nrecognition implies that knowledge of the text of the speech data is not used. Traditional\nmethods for text-independent speaker recognition are vector quantization [8], Gaussian\nmixture models [9], and arti\ufb01cial neural networks [8]. A state-of-the-art approach based\non polynomial classi\ufb01ers was presented in [7]. The polynomial approach has several ad-\n\n\fvantages over traditional methods\u20131) it is extremely computationally-ef\ufb01cient for identi-\n\ufb01cation, 2) the classi\ufb01er is discriminative which eliminates the need for a background or\ncohort model [10], and 3) the method generates small classi\ufb01er models.\n\nIn Section 2, we describe polynomial classi\ufb01ers and the associated scoring process. In\nSection 3, we review the process for mean-squared error training. Section 4 introduces the\nnew kernel. Section 5 compares the new kernel approach to the standard mean-squared\nerror training approach.\n\n2 Polynomial classi\ufb01ers for sequence data\n\nis an expansion of the input\n\nto avoid confusion with probabilities.\n\n[11]. We can then \ufb01nd the probability of the entire sequence,\n\nmotivate the classi\ufb01cation process from a probabilistic viewpoint.\n\nFor the veri\ufb01cation application, a decision is made from a sequence of observations ex-\nfrom the speech input. We decide based on the output of a discriminant\n\nIf the polynomial classi\ufb01er is trained with a mean-squared error training criterion and target\n\nWe start by considering the problem of speaker veri\ufb01cation\u2013a two-class problem. In this\ncase, the goal is to determine the correctness of an identity claim (e.g., a user id was entered\nis the class, then the decision to be made is if the\n. We\n\n(1)\nNote that we do not use a nonlinear activation function as is common in higher-order neural\nnetworks; this allows us to \ufb01nd a closed form solution for training. Also, note that we use\n\nin the system) from a voice input. If\nclaim is valid,\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n, or if an impostor is trying to break into the system,\u0002\u0001\n\t\f\u000b\r\u0005\ntracted\u000e\u0010\u000f\u0012\u0011\u0014\u0013\u0015\u0013\u0014\u0013\u0014\u0011\u0016\u000e\u0018\u0017\n\u0001 \u001f\"!$#\nfunction using a polynomial classi\ufb01er. A polynomial classi\ufb01er of the form\u0019\u001b\u001a\u001c\u000e\u001e\u001d\n\u001a\u001c\u000e\u001e\u001d\nwhere\u001f\nis the vector of classi\ufb01er parameters (model) and#\n\u0001\n& and\nor less is used. For example, if%\nspace into the vector of monomials of degree%\n! , then\n)+*-,\n\u0001('\n)32\n)+2\n\u0001/.\u001c01)\n*\u00144\n\u001a\u001c\u000e\u001e\u001d\na bold#\nfor\n\u0001(\u0003\u0016\u00055\u0007 and6\nfor7\u0001(\t\f\u000b\r\u0005\nvalues of0\n, then\u0019\u001b\u001a\u001c\u000e\u001e\u001d will approximate the a posteriori\n9\u0001:\u0003\u0016\u00055\u0007\u0018;\n\u000e\u001e\u001d\nprobability8\u0010\u001a\n=\u00017\u0003\u0016\u00055\u0007\n\u001d , as follows. Assuming independence of the observations [12] gives\n8\u0010\u001a<\u000e\n\u0011\u0015\u0013\u0014\u0013\u0015\u0013-\u0011\u0006\u000e\n8\u0010\u001a\u001c\u000e\u0010\u000f>\u0011\u0014\u0013\u0015\u0013\u0014\u0013\u0015\u0011\u0006\u000e\u0018\u0017\n8C\u001a\u001c\u000e\n@BA\nD;\n8C\u001a\n\u001d\u001c8\u0010\u001a<\u000e\n8\u0010\u001a\n@BA\n\u001d . We take the logarithm of both\nFor the purposes of classi\ufb01cation, we can discard8\u0010\u001a<\u000e\nEGF\nD;\n8\u0010\u001a\n\u001a\u001c\u000e\n\u001d(N\n8\u0010\u001a\n@\fA\n\u000f\u0018IBJLK\nto denote the sequence\u000eC\u000f\u0012\u0011\u0015\u0013\u0014\u0013\u0015\u0013-\u0011\u0006\u000e\u0018\u0017\nwhere we have used the shorthand\u000e\n)\"QR0 , to approximate the discriminant function and\n\u001dPO\nterms of the Taylor series,IBJLK\nD;\n8C\u001a\n\u001a<\u000e\n8\u0010\u001a\n@\fA\nNote that we have discarded theQT0\n\nin this discriminant function since this will not affect\nthe classi\ufb01cation decision. The key reason for using the Taylor approximation is that it\nreduces computation without signi\ufb01cantly affecting classi\ufb01er accuracy.\n\n(2)\n\n(3)\n\n(4)\n\n. We use two\n\nsides to get the discriminant function\n\nalso normalize by the number of frames to obtain the \ufb01nal discriminant function\n\n\u000e\n)\n\u000f\n#\n\u000f\n)\n*\n)\n*\n\u000f\n)\n\u000f\n)\n*\n)\n*\n*\n\u000f\n)\n*\n\u000f\n)\n*\n)\n\u000f\n)\n*\n*\n!\n\u0013\n\u000f\n\u0017\n;\n;\n\n\u001d\n\u0001\n\u0017\n?\n\u000f\n@\n;\n\n\u001d\n\u0001\n\u0017\n?\n\u000f\n\u000e\n@\n@\n\u001d\n\n\u001d\n\u0013\n@\n\u0017\n\u000f\n;\n\n\u001d\n\u0001\n\u0017\nH\nM\n\u000e\n@\n\u001d\n\n\u0017\n\u000f\n\u001a\n)\nE\n\u0017\n\u000f\n;\n\n\u001d\n\u0001\n0\nS\n\u0017\nH\n\u000f\n\u000e\n@\n\u001d\n\n\u001d\n\u0013\n\f\u001a<\u000e\n\nresulting problem is\n\nHere,\n\nthe impostor set.)\n\n, we construct\u0004\n\n7\u0001\n@\fA\n\n8\u0010\u001a\n\u001a\u001c\u000e\n\u0001R\u0003\u0016\u00055\u0007\n!\u0005\u0004\n\nExtending the sequence scoring framework to the case of identi\ufb01cation (i.e., identifying\nthe speaker from a list of speakers by voice) is straightforward. In this case, we construct\n\n(assuming equal prior probability of each speaker). Note that identi\ufb01cation has low com-\nputational complexity, since we must only compute one inner product to determine the\nspeaker\u2019s score.\n\n.\nis above a threshold then we declare the\nidentity claim valid; otherwise, the claim is rejected as an impostor attempt. More details\non this probabilistic scoring method can be found in [13].\n\n\u0003\u0016\u00055\u0007\u0018;\n\u000e\u001e\u001d ; we call the vector\u001f\nNow assume we have a polynomial function\u0019\u001b\u001a<\u000e\u001e\u001d\nthe speaker model. Substituting in the polynomial function\u0019\u001b\u001a<\u000e\u001e\u001d gives\n!$#\n=\u0001R\u0003\u0006\u00055\u0007\n8C\u001a\n@\fA\n\u001d\u0003\u0002\n\u001a\u001c\u000e\n=\u0001R\u0003\u0006\u00055\u0007\n8\u0010\u001a\n=\u0001R\u0003\u0006\u00055\u0007\n8\u0010\u001a\n# as\nwhere we have de\ufb01ned the mapping\u000e\n\u000f\u0007\u0006\n\u001a\u001c\u000e\n\u000f\b\u0006\n@\fA\n\u000f\u0012\u0011\u0014\u0013\u0014\u0013\u0015\u0013\u0006\u000e\u0018\u0017 and a speaker\nWe summarize the scoring method. For a sequence of input vectors\u000e\n# using (6). We then score using the speaker model,\u0003\n\t\n\u0001\n\u001f\nmodel,\u001f\nSince we are performing veri\ufb01cation, if\u0003\u000e\t\nJ\f\u000b\u000e\r\nJ\u000f\u000b\n\n@ and then choose the speaker\u0010 which maximizes\u001f\nspeaker models for each speaker\u001f\nD;\n\u000e\u001e\u001d ; this process will help us set notation for the following sections. Let\u001f\n8\u0010\u001a\n\tB\u000b\n0 and\u0011\ndesired speaker model and\u0011\nthe ideal output; i.e.,\u0011\n6 . The\n\u001f\u0013\u0012\n\u000b\r\t\u0017\u0016\n\u0001\u0015\u0014\n\u001a\u001c\u000e\u001e\u001d\n\u0019\u001b\u001a\u001d\u001c\nwhere\u0019 denotes expectation. This criterion can be approximated using the training set as\n*54\n*.-\n\u000b\r\t\"\u0016\n\u0001!\u0014\nQ\u00020\n\u001a32\n\u001a<\u000e\n#%$'&\u0017(*)\n@\fA\n\u000b\u0016K\n\u000f,+\n\u0011\u0014\u0013\u0015\u0013\u0014\u0013\u0014\u0011\u0016\u000e\nthe speaker\u2019s training data is \u000e\n\u0011\u0014\u0013\u0014\u0013\u0015\u0013-\u001162\nThe training method can be written in matrix form. First, de\ufb01ne7983:<; as the matrix whose\n\u001a\u001c\u000e\u0010\u000f\u0015\u001d\n\u001a\u001c\u000e\n\u0001?>\n7=83:<;\n$B&\u0017(*)\n\nWe next review how to train the polynomial classi\ufb01er to approximate the probability\nbe the\n\n, and the anti-speaker data is\n. (Anti-speakers are designed to have the same statistical characteristics as\n\n* \u001f\n\n\u0003\u0006\u00055\u0007\n\u001d\u000e\u001e\n\u000f1+\n\n$'/\n@BA\n\nrows are the polynomial expansion of the speaker\u2019s data; i.e.,\n\n3 Mean-squared error training\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n&\u0017(*)\nCED\n\n...\n\n\u001a\u001c\u000e\n\nO\nE\n\u0017\n\u000f\n;\n\u001d\n\u0001\n0\nS\n\u0017\nH\n\u000f\n\u001f\n@\n\u001d\n\n\u001d\n\u0001\n0\nS\n\u001d\n\u001f\n!\n\u0001\n\u0017\nH\n\u000f\n#\n@\n\u0001\n0\n\u001d\n\u001f\n#\n\u0017\n\u0004\n\u000e\n\u0017\n0\nS\n\u0017\nH\n\u000f\n#\n@\n\u001d\n\u0013\n!\n\u0004\n#\n!\n@\n\u0004\n#\n\u001a\n\n\u001d\n\u001a\n\u001d\n\u0001\n\u001a\n\u0005\n\u001d\n\u0001\n\u000b\nK\n\u0018\n\u001f\n!\n#\nQ\n\u0011\n\u001a\n\n\u001f\n\u0012\n\u0018\nH\n+\n\u001f\n!\n#\n@\n\u001d\n+\n+\n0\n(\nH\n+\n\u001f\n!\n#\n@\n\u001d\n+\n+\n\u0013\n\u000f\n$\n2\n\u000f\n$\n/\n0\n(\n@\n@\nA\n#\n!\n#\n*\n\u001d\n!\n#\n\u001d\n!\nD\nF\n\u0013\n\fThe problem (11) can be solved using the method of normal equations,\n\n(10)\n\n(11)\n\n(15)\n\n(18)\n\n(12)\n\n: zeros (i.e., the ideal output).\nand solve for\u001f\n\n(13)\n, then (13)\n\n(14)\n\nwhere\u0004\n\nwhere \b\n\nWe rearrange (12) to\n\nis the vector consisting of \u0006\n\nDe\ufb01ne a similar matrix for the impostor data,7\u0001\n: . De\ufb01ne\n7=8\nThe problem (8) then becomes\u001f\u0013\u0012D\u0001\u0015\u0014\n\t\u0017\u0016\nQ\u0005\u0004\n\u000b\u0016K\n; ones followed by \u0006\u0007\n79\u001e\n;\t\b\n83:<;\t\b\n\u0001\n\u0003\u0006\u00055\u0007\n;\u000e\r\u0012\u0006\u0013\n\nis the vector of all ones. If we de\ufb01ne \n\n4 The naive a posteriori sequence kernel\n\nO\u0011\u0006\n\n8\u0010\u001a\n\nbecomes\n\n\f\u000b\n\n\f\u000b\n\n\u0014\u000f\u0015\u000f\u0016\n\n\u0014\u000f\u0015\u0018\u0016\n\n\u0012\u0006\n\n.\n\n\u0012\u0006\u0013\n\nWe call \u001f\n\nand then computing\n\nWe can now combine the methods from Sections 2 and 3 to obtain a novel sequence com-\nparison kernel in a straightforward manner. Combine the speaker model from (14) with the\nscoring equation from (5) to obtain the classi\ufb01er score\n\n(because of the large anti-speaker\n\nThe scoring method in (16) is the basis of our sequence kernel. Given two sequences of\n\n(16)\n(note that this exactly the same as mapping the training\n\n(17)\nthe naive a posteriori sequence kernel since scoring assumes indepen-\ndence of observations and training approximates the a posteriori probabilities. The value\n\n\u0003\n\t\n=\u0001\n\u0003\u0016\u00055\u0007\n8C\u001a\nJ\u000f\u000b\n\n=\u00017\u0003\u0016\u00055\u0007\n83:<;\n;\u000e\r5\u001a\u000f\u0006\u0010\n83:\nNow8C\u001a\npopulation), so that (15) becomes\u0003\u000e\t\nJ\u000f\u000b\n\nwhere\u0004\n\u001d*7\n83:\nis\u001a\n83:\ndata using (6)), and\u0004\n:>\u001d\u0019\n\nis\u001a\n#\u001d\u001e\n#\u001d\u001c and2\u001b\u001a\n\u000f and2\u001b\u001a\n, we compare them by mapping\u000e\nspeech feature vectors,\u000e\n:<8-\u001a\u001c\u000e\n\u0011\u000e2\n:<8\n8-\u001a\u001c\u000e\n\u001162\n\u001d can be interpreted as scoring using a polynomial classi\ufb01er on the sequence\n, see (5), with the MSE model trained from feature vectors\u000e\n:<8-\u001a\u001c\u000e\n\u001d . I.e., if we transform\n##\u001c before training, the sequence kernel is a simple inner product.\nseconds on a Sun Ultra &G6 ,&\n&L6 MHz. Second, since the NAPS kernel explicitly\nto %G6\n@BA\n\nthe Cholesky decomposition. Then \u001f\nall the sequence data by \nFor our application in Section 5, this reduces training time from $ hours per speaker down\nmachine. Suppose '\n\nSeveral observations should be made about the NAPS kernel. First, scoring complexity can\nbe reduced dramatically in training by the following trick. We factor\nusing\n\nperforms the expansion to \u201cfeature space\u201d, we can simplify the output of the support vector\n\nis the (soft) output of the SVM,\n\n(or vice-versa because\n\n\u0001! \n\n\u0011\u000e2\"\u001a\n\n-+*\n\nof symmetry).\n\n:<8-\u001a\n\n\f\u000b\n\n\u0002\n7\n\u0001\n#\n:\n;\n7\n\n\u0002\n:\n4\n\u0013\n\u000b\n\u0018\n\u0003\n7\n\u001f\n\u0003\n*\n8\n:\n\u0002\n7\n!\n7\n\u001f\n\u0001\n7\n!\n\u0004\n\u0013\n\u001c\n7\n!\n\u001f\n\u0001\n7\n!\n8\n:\n\u0001\n7\n!\n7\n\u001f\n\u0001\n\u000f\n7\n!\n\u0013\n\u0001\n0\n\u001d\n\u0004\n#\n!\n\u001f\n\u0001\n0\n\n\u001d\n\u0004\n#\n!\n\n\u000b\n\u000f\n7\n!\n8\n:\n;\n\b\n\u0013\n\u001d\n\u0001\n\u0006\n\u0002\n:\n-\n\u0006\n\u001d\n8\n:\n\u0002\n:\n\u0001\n\u0004\n#\n!\n\u0004\n\u000f\n\u0004\n#\n!\n\n\u0017\n#\n\n\u0017\n0\n;\n!\n;\n\b\n\n0\n\u0002\n\u0017\n\u000f\n\u0017\n\u000f\n\u0006\n\u0004\n\u000f\n\u0006\n\u0004\n\u001f\n\u0017\n\u0016\n\u0017\n\u000f\n\u001a\n\u000f\n\u001d\n\u0001\n\u0004\n#\n!\n\u001c\n\u0004\n\u000f\n\u0004\n#\n\u001e\n\u0013\n\u0017\n\u0016\n\u001f\n\u0017\n\u0016\n:\n\u0017\n\u000f\n\u001a\n\u000f\n2\n\u001a\n\u000f\n\u001a\n\u000f\n\u0004\n\n\u000b\n\u000f\n!\n \n\u0017\n\u0016\n\u0017\n\u000f\n\u000f\n\u001d\n\u0001\n\u001a\n \n\u0004\n#\n\u001c\n\u001d\n!\n\u001a\n \n\u0004\n#\n\u001e\n\u0004\n\u001a\n\u0004\n#\n\u001d\n'\n\u001a\n\u0004\n#\n\u001d\n\u0001\n(\nH\n\u000f\n)\n@\n\u0011\n@\n\u001f\n\u0017\n\u0016\n\u0004\n#\n@\n\u0011\n\u0004\n#\n\u001d\n\u0013\n\fWe can simplify this to\n\nwhere \n\n-\u0001\ncollapse all the support vectors down into a single model\u001f\n\n! . That is, once we train the support vector machine, we can\n\nis the quantity in\nparenthesis in (19). Third, although the NAPS kernel is reminiscent of the Mahalanobis\ndistance, it is distinct. No assumption of equal covariance matrices for different classes\n(speakers) is made for the new kernel\u2013the kernel covariance matrix is a mixture of the\nindividual class covariances. Also, the kernel is not a distance measure\u2013no subtraction of\nmeans occurs as in the Mahalanobis distance.\n\n, where\u001f\n\n\u0013\u0014\u0013\u0015\u0013\n\n@\fA\n\n(19)\n\n5 Results\n\n5.1 Setup\n\nThe NAPS kernel was tested on the standard speaker recognition database YOHO [14] col-\nlected from 138 speakers. Utterances in the database consist of combination lock phrases of\n\ufb01xed length; e.g., \u201c23-45-56.\u201d Enrollment and veri\ufb01cation session were recorded at distinct\ntimes. (Enrollment is the process of collecting data for training and generating a speaker\nmodel. Veri\ufb01cation is the process of testing the system; i.e., the user makes an identity\nclaim and then this hypothesis is veri\ufb01ed.) For each speaker, enrollment consisted of four\nsessions each containing twenty-four utterances. Veri\ufb01cation consisted of ten separate ses-\nsions with four utterances per session (again per speaker). Thus, there are 40 tests of the\nspeaker\u2019s identity and 40*137=5480 possible impostor attempts on a speaker. For clarity,\nwe emphasize that enrollment and veri\ufb01cation session data is completely separate.\n\nTo extract features for each of the utterances, we used standard speech processing. Each\nframes/sec. The\n\n6 ms each with a frame rate of0\n\n\u0002\u0004\u0003\u0006\u0005\nwas performed to eliminate non-speech frames. This typically resulted in approximately\n\nutterance was broken up into frames of&\nmean was removed from each frame, and the frame was preemphasized with the \ufb01lter0\n\u000f . A Hamming window was applied and then0\b\u0007\n65\u0013\nfound. The resulting coef\ufb01cients were transformed to0\b\u0007 cepstral coef\ufb01cients. Endpointing\n6G6 observations per utterance.\n\nFor veri\ufb01cation, we measure performance in terms of the pooled and average equal error\nrates (EER). The average EER is found by averaging the individual EER for each speaker.\nThe individual EER is the threshold at which the false accept rate (FAR) equals the false\nreject rate (FRR). The pooled EER is found by setting a constant threshold across the\nentire population. When the FAR equals the FRR for the entire population this is termed\nthe pooled EER. For identi\ufb01cation, the misclassi\ufb01cation error rate is used.\n\nlinear prediction coef\ufb01cients were\n\n6L6\n\n5.2 Experiments\n\n(as in [7]). We then performed veri\ufb01cation using the\n\nspeakers against the \ufb01rst &\n\u0002 and\nTo eliminate bias in veri\ufb01cation, we trained the \ufb01rst &\t\u0002\nthe second &\t\u0002 against the second &\n\u0002\nsecond &\n\u0002 as impostors to the \ufb01rst &\n\u0002 speakers models and vice versa. This insures that the\n\n&\n\u000b speakers against each other.\nimpostors are unknown. For identi\ufb01cation, we trained all0\nTorch [15] and the NAPS kernel (17). The0\b\u0007 cepstral features were mapped to a di-\n$ vector using a& rd degree polynomial classi\ufb01er. Single utterances (i.e., \u201c23-\n\nWe trained support vector machines for each speaker using the software tool SVM-\n\n45-56\u201d) were converted to single vectors using the mapping (6) and then transformed with\n\nmension %\t$\n\n'\n\u001a\n\u0004\n#\n\u001d\n\u0001\n\u0001\n(\nH\n\u000f\n)\n@\n\u0011\n@\n\u0004\n\n\u000b\n\u000f\n\u0004\n#\n@\n\u0002\n!\n\u0004\n#\n\u0001\n'\n*\n6\n6\n,\nQ\n\u000b\n\u0007\n\fsessions as training and the % th enrollment session as a test to determine the best tradeoff\n\nthe Cholesky factor to reduce computation. We cross-validated using the \ufb01rst& enrollment\n0 was used with the \ufb01nal\n\nbetween margin and error; the best performing value of\nSVMTorch training. Using the identical set of features and the same methodology, clas-\nsi\ufb01er models were also trained using the mean-squared error criterion using the method\n\nin [7]. For \ufb01nal testing, all % enrollment session were used for training, and all veri\ufb01cation\n\nsessions were used for testing.\n\n65\u0013\nreduces error rates considerably\u2013the average EER is reduced by&\t\u000b\u0002\u0001\nof support vectors was0\nmodel size of0\n6 bytes\u2013over a hundred times reduction in size.\n\n, and the identi\ufb01cation error rate is reduced by %\n\u0002 which resulted in a model size of about \u0007\n\nResults for veri\ufb01cation and identi\ufb01cation are shown in Table 1. The new kernel method\n, the pooled EER is\n. The average number\n\n%G6 bytes (in single\n\nprecision \ufb02oating point); using the model size reduction method in Section 4 resulted in a\n\nreduced by %\n\nTable 1: Comparison of structural risk minimization and MSE training\n\nMSE\nAverage EER 1.63%\n2.76%\nPooled EER\nID error rate\n4.71%\n\nNAPS SVM\n\n1.01%\n1.45%\n2.72%\n\nWe also plotted scores for all speakers versus a threshold, see Figure 1. We normalized\nthe scores for the MSE and SVM approaches to the same range for comparison. One can\neasily see the reduction in pooled EER from the graph. Note also the dramatic shifting of\nthe FRR curve to the right for the SVM training, resulting in substantially better error rates\n, the MSE training method gives\n\u0001 \u2013a reduction by\n\nthan the MSE training. For instance, when FAR is65\u0013\na factor of&\n\n; whereas, the SVM training method gives an FRR of0\b\u0007\n\nan FRR of %\t$\n\nin error.\n\n102\n\nt\n\nn\ne\nc\nr\ne\nP\n\n101\n\n100\n\n10\u22121\n\u22124\n\nFAR(%) MSE\nFRR(%) MSE\nFAR(%) SVM\nFRR(%) SVM\n\n\u22122\n\n0\n\nThreshold\n\n2\n\n4\n\n6\n\nFigure 1: FAR/FRR rates for the entire population versus a threshold for the SVM and\nMSE training methods\n\n\n\u0001\n\u0003\n\u0001\n\u0007\n\u0001\n&\n$\n&\n\u0011\n$\n\u000b\n\u0007\n0\n\u0001\n\u0001\n\u0013\n\u0003\n$\n\f6 Conclusions and future work\n\nA novel kernel for comparing sequences in speech applications was derived, the NAPS\nkernel. This data-dependent kernel was motivated by using a probabilistic scoring method\nand mean-squared error training. Experiments showed that incorporating this kernel in\nan SVM training architecture yielded performance superior to that of the MSE training\ntimes were observed while retaining the\n\n\u0003\u000e$\n\ncriterion. Reduction in error rates of up to&\n\nef\ufb01ciency of the original MSE classi\ufb01er architecture.\nThe new kernel method is also applicable to more general situations. Potential applications\ninclude\u2013using the approach with radial basis functions, application to automatic speech\nrecognition, and extending to an SVM/HMM architecture.\n\nReferences\n[1] Vincent Wan and William M. Campbell, \u201cSupport vector machines for veri\ufb01cation and iden-\nti\ufb01cation,\u201d in Neural Networks for Signal Processing X, Proceedings of the 2000 IEEE Signal\nProcessing Workshop, 2000, pp. 775\u2013784.\n\n[2] Aravind Ganapathiraju and Joseph Picone, \u201cHybrid SVM/HMM architectures for speech recog-\n\nnition,\u201d in Speech Transcription Workshop, 2000.\n\n[3] John C. Platt, \u201cProbabilities for SV machines,\u201d inAdvances in Large Margin Classi\ufb01ers,\nAlexander J. Smola, Peter L. Bartlett, Bernhard Sch\u00a8olkopf, and Dale Schuurmans, Eds., pp.\n61\u201374. The MIT Press, 2000.\n\n[4] Tommi S. Jaakkola and David Haussler, \u201cExploiting generative models in discriminative clas-\nsi\ufb01ers,\u201d in Advances in Neural Information Processing 11, M. S. Kearns, S. A. Solla, and D. A.\nCohn, Eds. 1998, pp. 487\u2013493, The MIT Press.\n\n[5] Nathan Smith, Mark Gales, and Mahesan Niranjan, \u201cData-dependent kernels in SVM clas-\nsi\ufb01cation of speech patterns,\u201d Tech. Rep. CUED/F-INFENG/TR.387, Cambridge University\nEngineering Department, 2001.\n\n[6] Shai Fine, Ji\u02c7r\u00b4i Navr\u00b4atil, and Ramesh A. Gopinath, \u201cA hybrid GMM/SVM approach to speaker\nrecognition,\u201d inProceedings of the International Conference on Acoustics, Speech, and Signal\nProcessing, 2001.\n\n[7] William M. Campbell and Khaled T. Assaleh, \u201cPolynomial classi\ufb01er techniques for speaker\nveri\ufb01cation,\u201d in Proceedings of the International Conference on Acoustics, Speech, and Signal\nProcessing, 1999, pp. 321\u2013324.\n\n[8] Kevin R. Farrell, Richard J. Mammone, and Khaled T. Assaleh, \u201cSpeaker recognition using\nneural networks and conventional classi\ufb01ers,\u201d IEEE Trans. on Speech and Audio Processing,\nvol. 2, no. 1, pp. 194\u2013205, Jan. 1994.\n\n[9] Douglas A. Reynolds, \u201cAutomatic speaker recognition using Gaussian mixture speaker mod-\n\nels,\u201d The Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173\u2013192, 1995.\n\n[10] Michael J. Carey, Eluned S. Parris, and John S. Bridle, \u201cA speaker veri\ufb01cation system using\nalpha-nets,\u201d in Proceedings of the International Conference on Acoustics Speech and Signal\nProcessing, 1991, pp. 397\u2013400.\n\n[11] J\u00a8urgen Sch\u00a8urmann, Pattern Classi\ufb01cation, John Wiley and Sons, Inc., 1996.\n[12] Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice-\n\nHall, 1993.\n\n[13] William M. Campbell and C. C. Broun, \u201cA computationally scalable speaker recognition sys-\n\ntem,\u201d inProceedings of EUSIPCO, 2000, pp. 457\u2013460.\n\n[14] Joseph P. Campbell, Jr., \u201cTesting with the YOHO CD-ROM voice veri\ufb01cation corpus,\u201d inPro-\nceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1995,\npp. 341\u2013344.\n\n[15] Ronan Collobert and Samy Bengio, \u201cSupport vector machines for large-scale regression prob-\n\nlems,\u201d Tech. Rep. IDIAP-RR 00-17, IDIAP, 2000.\n\n\u0013\n\f", "award": [], "sourceid": 1951, "authors": [{"given_name": "William", "family_name": "Campbell", "institution": null}]}