{"title": "Discriminative Keyword Selection Using Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 216, "abstract": "Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive phrases, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method on a language recognition task shows that the technique produces interesting and significant qualitative and quantitative results.", "full_text": "Discriminative Keyword Selection Using Support\n\nVector Machines\n\nW. M. Campbell, F. S. Richardson\n\nMIT Lincoln Laboratory\nLexington, MA 02420\n\nwcampbell,frichard@ll.mit.edu\n\nAbstract\n\nMany tasks in speech processing involve classi\ufb01cation of long term characteristics\nof a speech segment such as language, speaker, dialect, or topic. A natural tech-\nnique for determining these characteristics is to \ufb01rst convert the input speech into\na sequence of tokens such as words, phones, etc. From these tokens, we can then\nlook for distinctive sequences, keywords, that characterize the speech. In many\napplications, a set of distinctive keywords may not be known a priori. In this\ncase, an automatic method of building up keywords from short context units such\nas phones is desirable. We propose a method for the construction of keywords\nbased upon Support Vector Machines. We cast the problem of keyword selection\nas a feature selection problem for n-grams of phones. We propose an alternat-\ning \ufb01lter-wrapper method that builds successively longer keywords. Application\nof this method to language recognition and topic recognition tasks shows that the\ntechnique produces interesting and signi\ufb01cant qualitative and quantitative results.\n\n1 Introduction\n\nA common problem in speech processing is to identify properties of a speech segment such as\nthe language, speaker, topic, or dialect. A typical solution to this problem is to apply a detection\nparadigm. A set of classi\ufb01ers is applied to a speech segment to produce a decision. For instance, for\nlanguage recognition, we might construct detectors for English, French, and Spanish. The maximum\nscoring detector on a speech segment would be the predicted language.\n\nTwo basic categories of systems have been applied to the detection problem. A \ufb01rst approach uses\nshort-term spectral characteristics of the speech and models these with Gaussian mixture models\n(GMMs) or support vector machines (SVMs) directly producing a decision. Although quite accurate,\nthis type of system produces only a classi\ufb01cation decision with no qualitative interpretation. A\nsecond approach uses high level features of the speech such as phones and words to detect the\nproperties. An advantage of this approach is that, in some instances, we can explain why we made a\ndecision. For example, a particular phone or word sequence might indicate the topic. We adopt this\nlatter approach for our paper.\n\nSVMs have become a common method of extracting high-level properties of sequences of speech\ntokens [1, 2, 3, 4]. Sequence kernels are constructed by viewing a speech segment as a document of\ntokens. The SVM feature space in this case is a scaling of co-occurrence probabilities of tokens in\nan utterance. This technique is analogous to methods for applying SVMs to text classi\ufb01cation [5].\n\nSVMs have been applied at many linguistic levels of tokens as detectors. Our focus in this paper\nis at the acoustic phone level. Our goal is to automatically derive long sequences of phones which\n\n\u2217This work was sponsored by the Department of Homeland Security under Air Force Contract FA8721-\n05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not\nnecessarily endorsed by the United States Government.\n\n1\n\n\fwe call keywords which are characteristic of a given class. Prior work, for example, in language\nrecognition [6], has shown that certain words are a signi\ufb01cant predictor of a language. For instance,\nthe presence of the phrase \u201cyou know\u201d in a conversational speech segment is a strong indicator of\nEnglish. A dif\ufb01culty in using words as the indicator of the language is that we may not have available\na speech-to-text (STT) system in all languages of interest. In this case, we\u2019d like to automatically\nconstruct keywords that are indicative of the language. Note that a similar problem can occur in\nother property extraction problems. For instance, in topic recognition, proper names not in our STT\nsystem dictionary may be a strong indicator of topic.\n\nOur basic approach is to view keyword construction as a feature selection problem. Keywords are\ncomposed of sequences of phones of length n, i.e. n-grams. We would like to \ufb01nd the set of\nn-grams that best discriminates between classes. Unfortunately, this problem is dif\ufb01cult to solve\ndirectly, since the number of unique n-grams grows exponentially with increasing n. To alleviate\nthis dif\ufb01cultly, we propose a method that starts with lower order n-grams and successively builds\nhigher order n-grams.\nThe outline of the paper is as follows. In Section 2.1, we review the basic architecture that we use\nfor phone recognition and how it is applied to the problem. In Section 2.2, we review the application\nof SVMs to determining properties. Section 3.1 describes a feature selection method for SVMs.\nSection 3.2 presents our method for constructing long context units of phones to automatically cre-\nate keywords. We use a novel feature selection approach that attempts to \ufb01nd longer strings that\ndiscriminate well between classes. Finally, in Section 4, we show the application of our method\nto language and topic recognition problems. We show qualitatively that the method produces in-\nteresting keywords. Quantitatively, we show that the method produces keywords which are good\ndiscriminators between classes.\n\n2 Phonotactic Classi\ufb01cation\n\n2.1 Phone Recognition\n\nThe high-level token extraction component of our system is a phone recognizer based upon the Brno\nUniversity (BUT) design [7]. The basic architecture of this system is a monophone HMM system\nwith a null grammar. Monophones are modeled by three states. This system uses two powerful\ncomponents to achieve high accuracy. First, split temporal context (STC) features provide contextual\ncues for modeling monophones. Second, the BUT recognizer extensively uses discriminatively\ntrained feedforward arti\ufb01cial neural networks (ANNs) to model HMM state posterior probabilities.\n\nWe developed a phone recognizer for English units using the BUT architecture and automatically\ngenerated STT transcripts on the Switchboard 2 Cell corpora [8]. Training data consisted of approx-\nimately 10 hours of speech. ANN training was accomplished using the ICSI Quicknet package [9].\nThe resulting system has 49 monophones including silence.\n\nThe BUT recognizer is used along with the HTK HMM toolkit [10] to produce lattices. Lattices\nencode multiple hypotheses with acoustic likelihoods. From a lattice, a 1-best (Viterbi) output can\nbe produced. Alternatively, we use the lattice to produce expected counts of tokens and n-grams of\ntokens.\n\nExpected counts of n-grams can be easily understood as an extension of standard counts. Suppose\nwe have a hypothesized string of tokens, W = w1,\u00b7\u00b7\u00b7 , wn. Then bigrams are created by group-\ning two tokens at a time to form, W2 = w1_w2, w2_w3,\u00b7\u00b7\u00b7 , wn\u22121_wn. Higher order n-grams\nare formed from longer juxtapositions of tokens. The count function for a given bigram, di, is\ncount(di|W2) is the number of occurrences of di in the sequence W2. To extend counts to a lattice,\nL, we \ufb01nd the expected count over all all possible hypotheses in the lattice,\n\ncount(di|L) = EW [count(di|W )] = XW \u2208L\n\np(W|L) count(di|W ).\n\n(1)\n\nThe expected counts can be computed ef\ufb01ciently by a forward-backward algorithm; more details\ncan be found in Section 3.3 and [11].\n\n2\n\n\fDj = min(cid:18)Cj, gj (cid:18)\n\n1\n\np(dj|all)(cid:19)(cid:19)\n\n(4)\n\nA useful application of expected counts is to \ufb01nd the probability of an n-gram in a lattice. For a\nlattice, L, the joint probability of an n-gram, di, is\n\nwhere the sum in (2) is performed over all unique n-grams in the utterance.\n\np(di|L) =\n\ncount(di|L)\nPj count(dj|L)\n\n(2)\n\n2.2 Discriminative Language Modeling: SVMs\n\nWe focus on token-based language recognition with SVMs using the approach from [1, 4]. Similar\nto [1], a lattice of tokens, L, is modeled using a bag-of-n-grams approach. Joint probabilities of\nthe unique n-grams, dj, on a per conversation basis are calculated, p(dj|L), see (2). Then, the\nprobabilities are mapped to a sparse vector with entries\nDjp(dj|W ).\n\nThe selection of the weighting, Dj, in (3) is critical for good performance. A typical choice is of the\nform\n\n(3)\n\nwhere gj(\u00b7) is a function which squashes the dynamic range, and Cj is a constant. The probabil-\nity p(dj|all) in (4) is calculated from the observed probability across all classes. The squashing\nfunction should monotonically map the interval [1,\u221e) to itself to suppress large inverse probabili-\nties. Typical choices for gj are gj(x) = \u221ax and gj(x) = log(x) + 1. In both cases, the squashing\nfunction gj normalizes out the typicality of a feature across all classes. The constant Cj limits the\neffect of any one feature on the kernel inner product. If we set Cj = 1, then this makes Dj = 1 for\nall j. For the experiments in this paper, we use gj(x) = \u221ax, which is suited to high frequency token\nstreams.\n\nThe general weighting of probabilities is then combined to form a kernel between two lattices,\nsee [1] for more details. For two lattices, L1 and L2, the kernel is\n\nK(L1,L2) = Xj\n\nD2\nj p(dj|L1)p(dj|L2).\n\n(5)\n\nIntuitively, the kernel in (5) says that if the same n-grams are present in two sequences and the\nnormalized frequencies are similar there will be a high degree of similarity (a large inner product).\nIf n-grams are not present, then this will reduce similarity since one of the probabilities in (5) will be\nzero. The normalization Dj insures that n-grams with large probabilities do not dominate the kernel\nfunction. The kernel can alternatively be viewed as a linearization of the log-likelihood ratio [1].\n\nIncorporating the kernel (5) into an SVM system is straightforward. SVM training and scoring\nrequire only a method of kernel evaluation between two objects that produces positive de\ufb01nite kernel\nmatrices (the Mercer condition). We use the package SVMTorch [12]. Training is performed with a\none-versus-all strategy. For each target class, we group all remaining class data and then train with\nthese two classes.\n\n3 Discriminative Keyword Selection\n\n3.1 SVM Feature Selection\n\nA \ufb01rst step towards an algorithm for automatic keyword generation using phones is to examine\nfeature selection methods. Ideally, we would like to select over all possible n-grams, where n is\nvarying, the most discriminative sequences for determining a property of a speech segment. The\nnumber of features in this case is prohibitive, since it grows exponentially with n. Therefore, we\nhave to consider alternate methods.\n\nAs a \ufb01rst step, we examine feature selection for \ufb01xed n and look for keywords with n or less phones.\nSuppose that we have a set of candidate keywords. Since we are already using an SVM, a natural\nalgorithm for discriminative feature selection in this case is to use a wrapper method [13].\n\n3\n\n\fSuppose that the optimized SVM solution is\n\nand\n\nf (X) = Xi\n\n\u03b1iK(X, Xi) + c\n\nw = Xi\n\n\u03b1ib(Xi)\n\n(6)\n\n(7)\n\nwhere b(Xi) is the vector of weighted n-gram probabilities in (3). We note that the kernel presented\nin (5) is linear. Also, the n-gram probabilities have been normalized in (3) by their probability across\nthe entire data set. Intuitively, because of this normalization and since f (X) = wtb(X) + c, large\nmagnitude entries in w correspond to signi\ufb01cant features.\n\nA con\ufb01rmation of this intuitive idea is the algorithm of Guyon, et. al. [14]. Guyon proposes an\niterative wrapper method for feature selection for SVMs which has these basic steps:\n\n\u2022 For a set of features, S, \ufb01nd the SVM solution with model w.\n\u2022 Rank the features by their corresponding model entries w2\n\nin (7).\n\ni . Here, wi is the ith entry of w\n\n\u2022 Eliminate low ranking features using a threshold.\n\nThe algorithm may be iterated multiple times.\n\nGuyon\u2019s algorithm for feature selection can be used for picking signi\ufb01cant n-grams as keywords.\nWe can create a kernel which is the sum of kernels as in (5) up to the desired n. We then train an\nSVM and rank n-grams according to the magnitude of the entries in the SVM model vector, w.\nAs an example, we have looked at this feature selection method for a language recognition task\nwith trigrams (to be described in Section 4). Figure 1 provides a motivation for the applicability of\nGuyon\u2019s feature selection method. The \ufb01gure shows two functions. First, the cumulative density\nfunction (CDF) of the SVM model values, |wi|, is shown. The CDF has an S-curve shape; i.e., only\na small set of models weights has large magnitudes. The second curve shows the equal error rate\n(EER) of the task as a function of applying one iteration of the Guyon algorithm and retraining the\nSVM. EER is de\ufb01ned as the value where the miss and false alarm rates are equal. All features with\n|wi| below the value on the x-axis are discarded in the \ufb01rst iteration. From the \ufb01gure, we see that\nonly a small fraction (< 5%) of the features are needed to obtain good error rates. This interesting\nresult provides motivation that a small subset of keywords are signi\ufb01cant to the task.\n\n1\n\n0.75\n\ni\n\n|\n\nw\n\n0.5\n\n|\n \n\nF\nD\nC\n\n0.25\n\nCDF\n\nEER\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n \nl\n\na\nu\nq\nE\n\n0\n10\u22124\n10\u22124\n\n10\u22123\n10\u22123\n\n10\u22122\n10\u22122\n\nThreshold\n\n10\u22121\n10\u22121\n\n1000\n100\n\nFigure 1: Feature selection for a trigram language recognition task using Guyon\u2019s method\n\n4\n\n\f3.2 Keywords via an alternating wrapper/\ufb01lter method\n\nThe algorithm in Section 3.1 gives a method for n-gram selection for \ufb01xed n. Now, suppose we\nwant to \ufb01nd keywords for arbitrary n. One possible hypothesis for keyword selection is that since\nhigher order n-grams are discriminative, lower order n-grams in the keywords will also be discrim-\ninative. Therefore, it makes sense to \ufb01nding distinguishing lower order n-grams and then construct\nlonger units from these. On the basis of this idea, we propose the following algorithm for keyword\nconstruction:\nKeyword Building Algorithm\n\nn, to all possible n-grams of phones\n\n\u2022 Start with an initial value of n = ns. Initialize the set, S \u2032\n\u2022 (Wrapper Step) General n. Apply the feature selection algorithm in Section 3.1 to produce\n\u2022 (Filter Step) Construct a new set of (n + 1)-grams by juxtaposing elements from Sn with\nn+1 =\n\nincluding lower order grams. By default, let S1 be the set of all phones.\na subset of distinguishing n-grams, Sn \u2282 S \u2032\nn.\nphones. Nominally, we take this step to be juxtaposition on the right and left, S \u2032\n{dp, qd|d \u2208 Sn, p \u2208 S1, q \u2208 S1}.\n\n\u2022 Iterate to the wrapper step.\n\u2022 Output: Sn at some stopping n.\n\nA few items should be noted about the proposed keyword building algorithm. First, we call the sec-\nond feature selection process a \ufb01lter step, since induction has not been applied to the (n + 1)-gram\nfeatures. Second, note that the purpose of the \ufb01lter step is to provide a candidate set of possible\n(n + 1)-grams which can then be more systematically reduced. Third, several potential algorithms\nexist for the \ufb01lter step. In our experiments and in the algorithm description, we nominally append\none phone to the beginning and end of an n-gram. Another possibility is to try to combine over-\nlapping n-grams. For instance, suppose the keyword is some_people which has phone transcript\ns_ah_m_p_iy_p_l. Then, if we are looking at 4-grams, we might see as top features s_ah_m_p and\np_iy_p_l and combine these to produce a new keyword.\n\n3.3 Keyword Implementation\n\nThe expected n-gram counts were computed from lattices using the forward-backward algo-\nrithm. Equation (8) gives the posterior probability of a connected sequence of arcs in the lattice\nwhere src_nd(a) and dst_nd(a) are the source and destination node of arc a, \u2113(a)is the likelihood\nassociated with arc a, \u03b1(n) and \u03b2(n) are the forward and backward probabilities of reaching node n\nfrom the beginning or end of the lattice L respectively, and \u2113(L) is the total likelihood of the lattice\n(the \u03b1(\u00b7) of the \ufb01nal node or \u03b2(\u00b7) of the initial node of the lattice).\n\np(aj, ..., aj+n) =\n\n\u03b1(src_nd(aj))\u2113(aj) . . . \u2113(aj+n)\u03b2(dst_nd(aj+n))\n\n(8)\n\n\u2113(L)\n\nNow if we de\ufb01ne the posterior probability of a node p(n) as p(n) = (\u03b1(n)\u03b2(n))/\u2113(L). Then\nequation (8) becomes:\n\np(aj, ..., aj+n) =\n\np(aj) . . . p(aj+n)\n\np(src_nd(aj+1)) . . . p(src_nd(aj+n))\n\n.\n\n(9)\n\nEquation (9) is attractive because it provides a way of computing the path posteriors locally using\nonly the individual arc and node posteriors along the path. We use this computation along with a\ntrie structure [15] to compute the posteriors of our keywords.\n\n4 Experiments\n\n4.1 Language Recognition Experimental Setup\n\nThe phone recognizer described in Section 2.1 was used to generate lattices across a train and an\nevaluation data set. The training data set consists of more than 360 hours of telephone speech\n\n5\n\n\fspanning 13 different languages and coming from a variety of different sources including Callhome,\nCallfriend and Fisher. The evaluation data set is the NIST 2005 Language Recognition Evaluation\ndata consisting of roughly 20,000 utterances (with duration of 30, 10 or 3 seconds depending on the\ntask) coming from three collection sources including Callfriend, Mixer and OHSU. We evaluated\nour system for the 30 and 10 second task under the the NIST 2005 closed condition which limits\nthe evaluation data to 7 languages (English, Hindi, Japanese, Korean, Mandarin, Spanish and Tamil)\ncoming only from the OHSU data source.\n\nThe training and evaluation data was segmented using an automatic speech activity detector and\nsegments smaller than 0.5 seconds were thrown out. We also sub-segmented long audio \ufb01les in the\ntraining data to keep the duration of each utterance to around 5 minutes (a shorter duration would\nhave created too many training instances). Lattice arcs with posterior probabilities lower than 10\u22126\nwere removed and lattice expected counts smaller than 10\u22123 were ignored. The top and bottom\n600 ranking keywords for each language were selected after each training iteration. The support\nvector machine was trained using a kernel formulation which requires pre-computing all of the\nkernel distances between the data points and using an alternate kernel which simply indexes into\nthe resulting distance matrix (this approach becomes dif\ufb01cult when the number of data points is too\nlarge).\n\n4.2 Language Recognition Results (Qualitative and Quantitative)\n\nTo get a sense of how well our keyword building algorithm was working, we looked at the top\nranking keywords from the English model only (since our phone recognizer is trained using the\nEnglish phone set). Table 1 summarizes a few of the more compelling phone 5-grams, and a possible\nkeyword that corresponds to each one. Not suprisingly, we noticed that in the list of top-ranking\nn-grams there were many variations or partial n-gram matches to the same keyword, as well as\nn-grams that didn\u2019t correspond to any apparent keyword.\nThe equal error rates for our system on the NIST 2005 language recognition evaluation are summa-\nrized in Table 2. The 4-gram system gave a relative improvement of 12% on the 10 second task and\n9% on the 30 second task, but despite the compelling keywords produced by the 5-gram system, the\nperformance actually degraded signi\ufb01cantly compared to the 3-gram and 4-gram systems.\n\nTable 1: Top ranking keywords for 5-gram SVM for English language recognition model\n\nphones\n\nRank\n\nSIL_Y_UW_N_OW\n\n!NULL_SIL_Y_EH_AX\n!NULL_SIL_IY_M_TH\n\nP_IY_P_AX_L\nR_IY_L_IY_SIL\n\nY_UW_N_OW_OW\n\nT_L_AY_K_SIL\nL_AY_K_K_SIL\n\nR_AY_T_SIL_!NULL\n\nHH_AE_V_AX_N\n\n!NULL_SIL_W_EH_L\n\n1\n3\n4\n6\n7\n8\n17\n23\n27\n29\n37\n\nkeyword\nyou know\n<s> yeah\n<s> ???\npeople\nreally\n\nyou know (var)\n\n? like\n\nlike (var)\nright </s>\nhave an\n<s> well\n\nTable 2: %EER for 10 and 30 second NIST language recognition tasks\n\nN\n\n10sec\n30sec\n\n1\n\n25.3\n18.3\n\n2\n\n16.5\n07.4\n\n3\n\n11.3\n04.3\n\n4\n\n10.0\n03.9\n\n5\n\n13.6\n05.6\n\n6\n\n\f4.3 Topic Recognition Experimental Setup\n\nTopic recognition was performed using a subset of the phase I Fisher corpus (English) from LDC.\nThis corpus consists of 5, 851 telephone conversations. Participants were given instructions to dis-\ncuss a topic for 10 minutes from 40 different possible topics. Topics included \u201cEducation\u201d, \u201cHob-\nbies,\u201d \u201cForeign Relations\u201d, etc. Prompts were used to elicit discussion on the topics. An example\nprompt is:\n\nMovies: Do each of you enjoy going to the movies in a theater, or would you\nrather rent a movie and stay home? What was the last movie that you saw? Was it\ngood or bad and why?\n\nFor our experiments, we used 2750 conversation sides for training. We also constructed development\nand test sets of 1372 conversation sides each. The training set was used to \ufb01nd keywords and models\nfor topic detection.\n\n4.4 Topic Recognition Results\n\nWe \ufb01rst looked at top ranking keywords for several topics; some results are shown in Table 3. We\ncan see that many keywords show a strong correspondence with the topic. Also, there are partial\nkeywords which correspond to what appears to be longer keywords, e.g. \u201ceh_t_s_ih_k\u201d corresponds\nto get sick.\n\nAs in the language recognition task, we used EER as the performance measure. Results in Table 4\nshow the performance for several n-gram orders. Performance improves going from 3-grams to 4-\ngrams. But, as with the language recognition task, we see a degradation in performance for 5-grams.\n\n5 Conclusions and future work\n\nWe presented a method for automatic construction of keywords given a discriminative speech classi-\n\ufb01cation task. Our method was based upon successively building longer span keywords from shorter\nspan keywords using phones as a fundamental unit. The problem was cast as a feature selection\nproblem, and an alternating \ufb01lter and wrapper algorithm was proposed. Results showed that reason-\nable keywords and improved performance could be achieved using this methodology.\n\nTable 3: Top keyword for 5-gram SVM in Topic Recognition\n\nTopic\n\nProfessional Sports on TV\nHypothetical: Time Travel\n\nAf\ufb01rmative Action\nUS Public Schools\n\nMovies\nHobbies\n\nSeptember 11\n\nIssues in the Middle East\n\nIllness\n\nHypothetical: One Million Dollars to leave the US\n\nPhones\n\nS_P_AO_R_T\n\nKeyword\n\nsport\n\nG_OW_B_AE_K\nAX_V_AE_K_CH [af\ufb01rmat]ive act[ion]\n\ngo back\n\nS_K_UW_L_Z\nIY_V_IY_D_IY\nHH_OH_B_IY_Z\nHH_AE_P_AX_N\n\nIH_Z_R_IY_L\nEH_T_S_IH_K\nY_UW_M_AY_Y\n\nschools\nDVD\nhobbies\nhappen\nIsrael\n\n[g]et sick\nyou may\n\nTable 4: Performance of Topic Detection for Different n-gram orders\n\nn-gram order\n\n3\n\n4\n\n5\n\nEER (%)\n\n10.22\n\n8.95\n\n9.40\n\n7\n\n\fNumerous possibilities exist for future work on this task. First, extension and experimentation on\nother tasks such as dialect and speaker recognition would be interesting. The method has the poten-\ntial for discovery of new interesting characteristics. Second, comparison of this method with other\nfeature selection methods may be appropriate [16]. A third area for extension is various technical\nimprovements. For instance, we might want to consider more general keyword models where skips\nare allowed (or more general \ufb01nite state transducers [17]). Also, alternate methods for the \ufb01lter for\nconstructing higher order n-grams is a good area for exploration.\n\nReferences\n\n[1] W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek, \u201cPhonetic\nspeaker recognition with support vector machines,\u201d in Advances in Neural Information Pro-\ncessing Systems 16, Sebastian Thrun, Lawrence Saul, and Bernhard Sch\u00f6lkopf, Eds. MIT\nPress, Cambridge, MA, 2003.\n\n[2] W. M. Campbell, T. Gleason, J. Navratil, D. Reynolds, W. Shen, E. Singer, and P. Torres-\nCarrasquillo, \u201cAdvanced language recognition using cepstra and phonotactics: MITLL system\nperformance on the NIST 2005 language recognition evaluation,\u201d in Proc. IEEE Odyssey,\n2006.\n\n[3] Bin Ma and Haizhou Li, \u201cA phonotactic-semantic paradigm for automatic spoken document\n\nclassi\ufb01cation,\u201d in The 28th Annual International ACM SIGIR Conference, Brazil, 2005.\n\n[4] Lu-Feng Zhai, Man hung Siu, Xi Yang, and Herbert Gish, \u201cDiscriminatively trained language\nmodels using support vector machines for language identi\ufb01cation,\u201d in Proc. IEEE Odyssey:\nThe Speaker and Language Recognition Workshop, 2006.\n\n[5] T. Joachims, Learning to Classify Text Using Support Vector Machines, Kluwer Academic\n\nPublishers, 2002.\n\n[6] W. M. Campbell, F. Richardson, and D. A. Reynolds, \u201cLanguage recognition with word lattices\n\nand support vector machines,\u201d in Proceedings of ICASSP, 2007, pp. IV\u2013989 \u2013 IV\u2013992.\n\n[7] Petr Schwarz, Matejka Pavel, and Jan Cernocky, \u201cHierarchical structures of neural networks\n\nfor phoneme recognition,\u201d in Proceedings of ICASSP, 2006, pp. 325\u2013328.\n\n[8] Linguistic Data Consortium, \u201cSwitchboard-2 corpora,\u201d http://www.ldc.upenn.edu.\n[9] \u201cICSI QuickNet,\u201d http://www.icsi.berkeley.edu/Speech/qn.html.\n[10] S. Young, Gunnar Evermann, Thomas Hain, D. Kershaw, Gareth Moore, J. Odell, D. Ollason,\n\nV. Valtchev, and P. Woodland, The HTK book, Entropic, Ltd., Cambridge, UK, 2002.\n\n[11] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.\n[12] Ronan Collobert and Samy Bengio, \u201cSVMTorch: Support vector machines for large-scale\n\nregression problems,\u201d Journal of Machine Learning Research, vol. 1, pp. 143\u2013160, 2001.\n\n[13] Avrim L. Blum and Pat Langley, \u201cSelection of relevant features and examples in machine\n\nlearning,\u201d Arti\ufb01cial Intelligence, vol. 97, no. 1-2, pp. 245\u2013271, Dec. 1997.\n\n[14] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, \u201cGene selection for cancer classi\ufb01cation using\n\nsupport vector machines,\u201d Machine Learning, vol. 46, no. 1-3, pp. 389\u2013422, 2002.\n\n[15] Konrad Rieck and Pavel Laskov, \u201cLanguage models for detection of unknown attacks in net-\n\nwork traf\ufb01c,\u201d Journal of Computer Virology, vol. 2, no. 4, pp. 243\u2013256, 2007.\n\n[16] Takaaki Hori, I. Lee Hetherington, Timothy J. Hazen, and James R. Glass, \u201cOpen-vocabulary\n\nspoken utterance retrieval using confusion neworks,\u201d in Proceedings of ICASSP, 2007.\n\n[17] C. Cortes, P. Haffner, and M. Mohri, \u201cRational kernels,\u201d in Advances in Neural Information\nProcessing Systems 15, S. Thrun S. Becker and K. Obermayer, Eds., Cambridge, MA, 2003,\npp. 601\u2013608, MIT Press.\n\n8\n\n\f", "award": [], "sourceid": 703, "authors": [{"given_name": "Fred", "family_name": "Richardson", "institution": null}, {"given_name": "William", "family_name": "Campbell", "institution": null}]}