{"title": "Using Voice Transformations to Create Additional Training Talkers for Word Spotting", "book": "Advances in Neural Information Processing Systems", "page_first": 875, "page_last": 882, "abstract": null, "full_text": "Using Voice Transformations to Create \n\nAdditional Training Talkers for Word Spotting \n\nEric I. Chang and Richard P.  Lippmann \n\nMIT Lincoln Laboratory \n\nLexington, MA 02173-0073, USA \n\neichang@sst.ll.mit.edu and rpl@sst.ll.mit.edu \n\nAbstract \n\nSpeech  recognizers  provide  good  performance  for  most  users  but  the \nerror rate often  increases dramatically  for a small  percentage of talkers \nwho are \"different\" from  those talkers used for training.  One expensive \nsolution to  this problem is  to  gather more training data in an attempt to \nsample these outlier users. A second solution, explored in this paper,  is \nto artificially enlarge the number of training talkers by transforming the \nspeech of existing training talkers. This approach is similar to enlarging \nthe training set for  OCR digit recognition  by  warping the training digit \nimages,  but  is  more  difficult  because  continuous  speech  has  a  much \nlarger number of dimensions  (e.g.  linguistic,  phonetic,  style,  temporal, \nspectral) that differ across talkers. We explored the use of simple linear \nspectral warping to enlarge a 48-talker training data base used for  word \nspotting.  The  average  detection  rate  overall  was  increased  by  2.9 \npercentage  points  (from  68.3%  to  71.2%)  for  male  speakers  and  2.5 \npercentage  points  (from  64.8%  to  67.3%)  for  female  speakers.  This \nincrease is  small but similar to  that obtained by  doubling the amount of \ntraining data. \n\n1  INTRODUCTION \nSpeech  recognizers,  optical  character  recognizers,  and  other  types  of pattern  classifiers \nused for human interface applications often provide good performance for most users. Per(cid:173)\nformance is often, however, low and unacceptable for a small percentage of \"outlier\" users \nwho are presumably  not represented  in  the training  data.  One expensive  solution  to  this \nproblem is  to  obtain more training data in  the hope of including users from  these outlier \n\n\f876 \n\nEric  I.  Chang,  Richard P.  Lippmann \n\nclasses. Other approaches already used for speech recognition are to use input features and \ndistance metrics  that are  relatively  invariant to  linguistically  unimportant differences  be(cid:173)\ntween talkers and to adapt a recognizer for individual talkers. Talker adaptation is difficult \nfor word spotting and with poor outlier users because the recognition error rate is high and \ntalkers often can not be prompted to recite standard phrases that can be used for adaptation. \nAn alternative approach, that has not been fully explored for speech recognition, is to arti(cid:173)\nficially expand the number of training talkers using voice transformations. \n\nTransforming the speech of one talker to make it sound like that of another is difficult be(cid:173)\ncause speech varies across many difficult-to-measure dimensions including linguistic, pho(cid:173)\nnetic, duration, spectra, style, and accent. The transformation task is thus more difficult than \nin optical character recognition where a small set of warping functions can be successfully \napplied to character images to enlarge the number of training images (Drucker,  1993). This \npaper demonstrates how a transformation accomplished by warping the spectra of training \ntalkers to create more training  data can improve the  performance of a  whole-word word \nspotter on a large spontaneous-speech data base. \n\n2  BASELINE WORD SPOTTER \nA hybrid radial basis function (RBF) - hidden Markov model (HMM) keyword spotter has \nbeen  developed  over the  past few  years  that  provides  state-of-the-art performance  for  a \nwhole-word word spotter on the large spontaneous-speech credit-card speech corpus. This \nsystem spots 20 target keywords, includes one general filler class, and uses a Viterbi decod(cid:173)\ning backtrace as described in (Chang,  1994) to backpropagate errors over a sequence of in(cid:173)\nput speech frames.  This neural  network word spotter is trained on target and background \nclasses, normalizes target outputs using the background output, and thresholds the resulting \nscore to generate putative hits,  as shown in  Figure 1.  Putative hits in  this figure are input \npatterns which generate normalized scores above a threshold. The performance of this, and \nother spotting systems,  is analyzed by  plotting a detection versus false  alarm  rate curve. \nThis curve is  generated by  adjusting the classifier output threshold to allow few  or many \nputative hits. The figure of merit (FOM) is defined as the average keyword detection rate \nwhen the false alarm rate ranges from  1 to  10 false alarms per keyword per hour. The pre(cid:173)\nvious best FOM for this word spotter is 67.8% when trained using 24 male talkers and test(cid:173)\ned on  11  male talkers,  and 65.9%  when  trained using  24 female  talkers and tested on  11 \nfemale talkers. The overall FOM for all  talkers is 66.3%. \n\nCONTINUOUS \nSPEECH INPUT \n\nNEURAL NETWORK \n\nWORDSPOTTER \n\nATIVE HITS \n\nFigure 1: \n\nBlock diagram of neural network word spotter. \n\n\fUsing  Voice  Transfonnations  to  Create Additional Training  Talkers for  Word Spotting \n\n877 \n\n3  TALKER VARIABILITY \nFOM scores of test talkers  vary  over a wide range. When training on 48  talkers and  then \nperforming testing on 22 talkers from the 70 conversations in the NIST Switchboard credit \ncard database, the FOM ofthe test talkers varies from  16.7% to 100%. Most talkers perform \nwell above 50%, but there are two female talkers with FOM's of 16.7% and 21.4%. The low \nFOM for  individual speakers indicates a lack of training data with voice qualities that are \nsimilar to these test speakers. \n\n4  CREATING MORE TRAINING DATA USING VOICE \n\nTRANSFORMATIONS \n\nTalker adaptation is difficult for word spotting because error rates are high and talkers often \ncan not be prompted to verify adaptation phrases. Our approach to increasing performance \nacross talkers uses voice transformation techniques to generate more varied training exam(cid:173)\nples of keywords as shown in Figure 2.  Other researchers have used talker transformation \ntechniques to  produce more  natural  synthesized speech (lwahashi,  1994,  Mizuno,  1994), \nbut using talker transformation techniques to  generate more training data is  novel. \n\nWe have implemented a new voice transformation technique which utilizes the Sinusoidal \nTransform Analysis/Synthesis System (STS) described in (Quatieri,  1992). This technique \nattempts to transform one talker's speech pattern to that of a different talker. The STS gen(cid:173)\nerates a 512 point spectral envelope of the input speech 100 times a second and also sepa(cid:173)\nrates  pitch  and  voicing  information.  Separation  of vocal  tract  characteristic  and  pitch \ninformation has allowed the implementation of pitch and time transformations in previous \nwork (Quatieri, 1992). The system has been modified to  generate and accept a spectral en-\n\nVOICE \n\nTRANSFORMATION \n\nSYSTEM \n\nORIGINAL \nSPEECH \n\nFigure 2:  Generating more training data by artificially transforming original speech \n\ntraining data. \n\nTRANSFORMED \n\nSPEECH \n\n\f878 \n\nEric I.  Chang,  Richard P.  Lippmann \n\nvelo.pe  file  fro.m  an  input speech sample.  We  info.nnally  explo.red different  techniques  to' \ntransfo.nn the spectral envelo.pe to'  generate mo.re  varied training examples by listening to' \ntransfo.nned speech. This resulted in the fo.llo.wing algo.rithm that transfo.nns a talker's vo.ice \nby scaling the spectral envelo.pe o.f training talkers. \n\n1.  Training co.nversatio.ns are upsampled fro.m  8000 Hz to'  10,000 Hz to' be \nco.mpatible with existing STS co.ding so.ftware. \n2.  The STS system pro.cesses the upsampled files and generates a 512 point \nspectral envelo.pe o.f the input speech wavefo.rm at a frame rate o.f 100 frames a \nseco.nd and with a windo.w length o.f appro.ximately 2.5 times the length o.f each \npitch perio.d. \n3.  A new spectral envelo.pe is generated by  linearly expanding o.r co.mpressing \nthe spectral axis. Each spectral po.int is identified by its index, ranging fro.m  0 to' \n511. To.  transfo.rm a spectral pro.file by 2, the new spectral value at frequency f is \ngenerated by averaging the spectral values aro.und the o.riginal spectral pro.file at \nfrequency o.f 0.5 f  The transfo.nnatio.n process is illustrated in Figure 3. In this \nfigure, an o.riginal spectral envelo.pe is being expanded by two.. The spectral value \nat index 150 is thus transfo.rmed to. spectral index 300 in the new envelo.pe and the \no.riginal spectral info.rmatio.n at high frequencies is lo.st. \n4.  The transfo.rmed spectral value is used to. resynthesize a speech wavefo.rm \nusing the vocal tract excitatio.n info.nnatio.n extracted fro.m  the o.riginal file. \n\nVo.ice  transfo.rmatio.n  with  the  STS  co.der  allo.ws  listening  to.  transfo.rmed  speech  but  re(cid:173)\nquires lo.ng co.mputatio.n. We simplified o.ur appro.ach to' o.ne o.f mo.difying the spectral scale \nin the spectral do.main directly within a mel-scale filterbank analysis pro.gram. The inco.m(cid:173)\ning speech sample is pro.cessed with an FFT to' calculate spectral magnitudes. Then spectral \nmagnitudes are linearly transfo.rmed. Lastly mel-scale filtering is perfo.rmed with 10 linear(cid:173)\nly spaced filters up to' 1000 Hz and lo.garithmically spaced filters fro.m  1000 Hz up. A co.sine \ntransfo.rm is then used to' generate mel-scaled cepstral values that are used by the wo.rdspo.t(cid:173)\nter. Much faster pro.cessing can be achieved by applying the spectral transfo.rmatio.n as part \no.f the filterbank analysis. Fo.r example, while perfo.rming spectral transfo.nnatio.n using the \nSTS algo.rithm takes up to' approximately 10 times real time, spectral transfo.rmatio.n within \nthe mel-scale filterbank pro.gram can be acco.mplished within  1110 real time o.n  a Sparc  10 \nwo.rkstatio.n. The rapid pro.cessing rate allo.ws o.n-line spectral transfo.rmatio.n. \n\n5  WORD SPOTTING EXPERIMENTS \nLinear warping  in  the  spectral  do.main,  which  is  used  in  the  above  algo.rithm,  is  co.rrect \nwhen  the  vo.cal  tract is  mo.delled as  a series o.f lo.ssless aco.ustic  tubes  and the excitatio.n \nso.urce is at o.ne end o.f the vo.cal tract (Wakita,  1977). Wakita sho.wed that if the vo.cal  tract \nis mo.delled as a series o.f equal length, lo.ssless,  and co.ncatenated aco.ustic tubes, then the \nratio. o.f the areas between the tubes determines the relative reso.nant frequencies o.f the vo.cal \ntract, while the o.verall length o.fthe vo.cal tract linearly scales fo.rmant frequencies. Prelim(cid:173)\ninary research was co.nducted using linear scaling with spectral ratio.s ranging fro.m  0.6 to. \n1.8 to. alter test utterances. After listening to. the STS transfo.rmed speech and also. o.bserving \n\n\fUsing  Voice  Transformations  to  Create  Additional Training  Talkers for  Word Spotting \n\n879 \n\nOriginal \nSpectral \nEnvelope \n\nTransformed \n\nSpectral \nEnvelope \n\no \n\n\" \n\no \n\n150 \n\nIII  I  I \n\n511 \n\n300 \n\n511 \n\nSpectral Value Index \n\nFigure 3:  An example of the spectral transformation algorithm where the \n\noriginal spectral envelope frequency scale is expanded by 2. \n\nspectrograms of the transformed speech, it was found that speech transformed using ratios \nbetween 0.9 and  1.1  are reasonably  natural  and can represent speech without introducing \nartifacts. \n\nUsing discriminative training techniques such as FOM training carries the risk of overtrain(cid:173)\ning the wordspotter on the training set and obtaining results that are poor on the testing set. \nTo delay the onset of overtraining, we artificially transformed each training set conversation \nduring each epoch using a different random linear transformation ratio. \n\nThe transformation ratio used for each conversation is  calculated using the following  for(cid:173)\nmula: ratio ==  a + N (0,0.06), where a  is the transformation ratio that matches each training \nspeaker to the average of the training set speakers, and N  is a normally distributed random \nvariable with a mean of 0.0 and standard deviation of 0.06. For each training conversation, \nthe long term averages of formant frequencies for formant 1, 2, and 3 are calculated. A least \nsquare estimation is  then performed to match the formant frequencies  of each training set \nconversation to the group average formant frequencies. The transformation equation is de(cid:173)\nscribed below: \n\n\f880 \n\nEric I.  Chang,  Richard P.  Lippmann \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nt \n\n\u2022  :--! \n\n100 \n\n90 \n\n~ 80 \n\no -\n\na-\n\n\u2022 \u2022   . , .   \u2022 \n\n\u2022 \n\nNORMAL \n\n50 \n\no \n\n234   5  67 8   9 \n\nNUMBER OF EPOCHS OF FOM TRAINING \n\n10 \n\nFigure 4: \n\nAverage detection accuracy (FOM) for the male training and testing \nset versus the number of epochs of FOM training. \n\nThe transform ratio for  each individual conversation is  calculated to  improve the natural(cid:173)\nness of the transformed speech. In preliminary experiments, each conversation was trans(cid:173)\nformed  with  fixed  ratios of 0.9, 0.95,  1.05, and  1.1.  However,  for a speaker with already \nhigh  formant  frequencies,  pushing  the  formant  frequencies  higher  may  make  the  trans(cid:173)\nformed  speech  sound unnatural.  By  incorporating  the  individual  formant  matching  ratio \ninto the transformation ratio,  speakers with high formant frequencies  are not transformed \nto very high frequencies and speakers with low formant frequencies are not transformed to \neven lower formant frequency ranges. \n\nMale and female conversations from the NIST credit card database were used separately to \ntrain separate word spotters. Both the male and the female partition of data used 24 conver(cid:173)\nsations for training and  11  conversations for testing. Keyword occurrences were extracted \nfrom each training conversation and used as the data for initialization of the neural network \nword spotter. Also, each training conversation was broken up into sentence length segments \nto be used for embedded reestimation in which the keyword models are joined with the filler \nmodels and the parameters of all the models are jointly estimated. After embedded reesti(cid:173)\nmation, Figure of Merit training as described in (Lippmann. 1994) was performed for up to \n10 epochs. During each epoch. each training conversation is transformed using a transform \nratio  randomly  generated as  described above.  The performance  of the  word spotter after \neach iteration of training is evaluated on both the training set and the testing set. \n\n6  WORD SPOTTING RESULTS \nTraining  and  testing  set FOM  scores  for  the  male  speakers  and  the  female  speakers  are \nshown in Figure 4 and Figure 5 respectively. The x axis plots the number of epochs of FOM \ntraining  where  each  epoch  represents  presenting  all  24  training  conversations  once.  The \nFOM for  word spotters trained  with  the normal  training conversations and word spotters \n\n\fUsing  Voice  Transformations  to  Create Additional Training  Talkers for Word  Spotting \n\n881 \n\n100 \n\n90 \n\n?j  80 -\n\n60 \n\n50 \n\nau  \u2022 \n~ - = I  - -. \" - \u2022  .- TRAIN \n\n.. - ra- \". \n\n\u2022 \n\n~= \u2022 \n\n\\ \n\nNORMAL \n\no  23456789  \n\nNUMBER OF EPOCHS OF FOM TRAINING \n\n10 \n\nFigure 5:  Average detection accuracy (FOM) for the female training and testing \n\nset versus the number of iterations of FOM training. \n\ntrained with artificially expanded training conversations are shown in  each plot. After the \nfirst epoch, the FOM improves significantly. With only the original training conversations \n(normal), the testing set FOM rapidly  levels off while the training set FOM keeps on im(cid:173)\nproving. \n\nWhen the training conversations are artificially expanded, the training set FOM is below the \ntraining set FOM from the normal training set due to more difficult training data. However, \nthe testing set FOM continues to improve as more epochs of FOM training are performed. \nWhen comparing the FOM of wordspotters trained on the two sets of data after ten epochs \nof training, the FOM for the expanded set was 2.9 percentage points above the normal FOM \nfor  male speakers and 2.5  percentage points above the  normal FOM for female speakers. \nFor comparison, Carlson has reported that for a high performance word spotter on this da(cid:173)\ntabase, doubling the amount of training data typically increases the FOM by 2 to 4 percent(cid:173)\nage points (Carlson,  1994). \n\n7  SUMMARY \nLack of training data has always been a constraint in training speech recognizers. This re(cid:173)\nsearch presents a voice transformation technique which increases the variety  among train(cid:173)\ning talkers. The resulting more varied training set provided up to 2.9 percentage points of \nimprovement  in  the  figure  of merit  (average detection  rate)  of a  high  performance  word \nspotter. This improvement is similar to  the increase in  performance provided by  doubling \nthe  amount of training data (Carlson,  1994). This technique can  also be applied to  other \nspeech  recognition  systems  such  as  continuous  speech  recognition,  talker identification, \nand isolated speech recognition. \n\n\f882 \n\nEric  I.  Chang.  Richard P.  Lippmcmn \n\nACKNOWLEDGEMENT \nThis work was sponsored by the Advanced Research Projects Agency. The views expressed \nare those of the authors and do not reflect the official policy or position of the U.S. Govern(cid:173)\nment. We  wish to thank Tom Quatieri for providing his sinusoidal transform analysis/syn(cid:173)\nthesis system. \n\nBIBLIOGRAPHY \nB. Carlson and D. Seward.  (1994) Diagnostic Evaluation and Analysis of Insufficient and \nTask-Independent Training Data on Speech Recognition. In Proceedings Speech Research \nSymposium XlV,  Johns Hopkins University. \nE. Chang and R. Lippmann. (1994) Figure of Merit Training for Detection and Spotting. In \nNeural Information  Processing Systems  6,  G. Tesauro, J.  Cohen, and J. Alspector,  (Eds.), \nMorgan Kaufmann: San Mateo, CA. \nH. Drucker, R. Schapire, and P. Simard. (1993) Improving Performance in Neural Networks \nUsing  a Boosting Algorithm.  In Neural Information  Processing Systems 5, S.  Hanson, J. \nCowan, and C. L. Giles, (Eds.), Morgan Kaufmann:  San Mateo, California. \nN. Iwahashi and Y. Sagisaka. (1994) Speech Spectrum Transformation by Speaker Interpo(cid:173)\nlation.  In Proceedings International Conference on Acoustics Speech and Signal Process(cid:173)\ning, Vol.  1,461-464. \nR.  Lippmann,  E.  Chang  &  C.  Jankowski.  (1994) Wordspotter Training  Using Figure-of(cid:173)\nMerit Back Propagation. In Proceedings of International Conference on Acoustics Speech \nand Signal Processing, Vol.  1,389-392. \n\nH.  Mizuno and M. Abe.  (1994) Voice Conversion Based on Piecewise Linear Conversion \nRules of Formant Frequency and Spectrum Tilt. In Proceedings International Conference \nOil Acoustics Speech and Signal Processing, Vol.  1,469-472. \nT.  Quatieri and R. McAulay. (1992) Shape Invariant Time-Scale and Pitch Modification of \nSpeech. In IEEE Trans.  Signal Processing, vol 40, no 3. pp. 497-510. \n\nHisashi Wakita. (1977) Normalization of Vowels by Vocal-Tract Length and Its Application \nto  Vowel  Identification.  In  IEEE  Trans.  Acoustics,  Speech,  and Signal  Processing,  vol. \nASSP-25, No.2., pp.  183-192. \n\n\f", "award": [], "sourceid": 950, "authors": [{"given_name": "Eric", "family_name": "Chang", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}