{"title": "Optimizing Correlation Algorithms for Hardware-Based Transient Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 678, "page_last": 684, "abstract": null, "full_text": "Optimizing Correlation Algorithms for \nHardware-based Transient Classification \n\nR. Timothy Edwardsl , Gert Cauwenberghsl, and Fernando J. Pineda2 \n\n1 Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 \n\n2 Applied Physics Laboratory, Johns Hopkins University, Laurel, MD 20723 \n\ne-mail:  {tim, gert, fernando}@bach.ece. jhu. edu \n\nAbstract \n\nThe perfonnance of dedicated VLSI neural processing hardware depends \ncritically  on  the  design  of the  implemented  algorithms.  We  have  pre(cid:173)\nviously  proposed an  algorithm  for  acoustic  transient classification  [1]. \nHaving implemented and demonstrated this algorithm in a mixed-mode \narchitecture,  we  now  investigate  variants  on  the algorithm,  using  time \nand frequency channel differencing, input and output nonnalization, and \nschemes to binarize and train the template values, with the goal of achiev(cid:173)\ning optimal classification perfonnance for the chosen hardware. \n\n1  Introduction \n\nAt the NIPS conference in  1996 [1],  we  introduced an  algorithm  for classifying acoustic \ntransient signals using template correlation. While many pattern classification systems use \ntemplate correlation [2},  our system differs in directly addressing the issue of efficient im(cid:173)\nplementation in analog hardware, to overcome the area and power consumption drawbacks \nof equivalent digital systems.  In the intervening two years, we have developed analog cir(cid:173)\ncuits and built VLSI hardware implementing both the template correlation and the frontend \nacoustic processing necessary to map the transient signal into a time-frequency representa(cid:173)\ntion corresponding to the template [3, 4].  In the course of hardware development, we have \nbeen led  to  reevaluate the algorithm in  the light of the possibilities and  the limitations of \nthe chosen hardware. \n\nThe general architecture is depicted in Figure 1 (a), and excellent agreement between simu(cid:173)\nlations and experimental output from a prototype is illustrated in Figure 1 (b).  Issues of im(cid:173)\nplementation efficiency and circuit technology aside,  the this paper specifically addresses \nfurther  improvements in  classification  perfonnance achievable by  algorithmic  modifica(cid:173)\ntions, tailored to the constraints and strengths of the implementation medium. \n\n\fOptimizing Correlation Algorithms/or Transient Classification \n\n679 \n\nTemplate \n\nCorrelator NxM \n\nN \n\nShift and Accumulate \n\n70 \n\n60 \n\n50 \n'5 \n.9- 40 \n::I o \nc  30 \no \n~ 20 \n~ \n8 \n\n-20 \n\n-\n\nSimulated \n\n\u2022  Measured \n\ncJtl \n\n20 \n\n40 \n\n60 \n\n80 \nTime \n\n100 \n\n120 \n\n140 \n\n(a) \n\n(b) \n\nFigure  1:  (a) System architecture of the acoustic transient classifier (b) Demonstration of \naccurate computation in  the analog correlator on a transient classification task. \n\n2  The transient classification algorithm \n\nThe core of our architecture performs the  running correlation  between an  acoustic  input \nand  a  set  of templates  for  distiguishing  between  Z  distinct  classes.  A  simple  template \ncorrelation equation for the acoustic transient classification can be written: \n\nM \n\nN \n\ncz[tJ  =  Kz  L  L  x[t - n, mJ pz[n, mJ \n\n(1) \n\nm=l n=l \n\nwhere M  is the number of frequency channels of the input, N  is the maximum number of \ntime bins in the window, and x  is the array of input signals representing the energy content \nin each of the M  bandpass frequency channels. The inputs x  are normalized across chan(cid:173)\nnels using an L-l  normalization so that the correlation is less affected by volume changes \nin the input.  The matrix pz  contains the template pattern values for pattern z out of a total \nof Z  classes;  K z  is  a constant gain coefficient for class z,  and t is the current time.  This \nformula produces a running correlation Cz [tJ  of the input array with the template for class \nz.  A signal is classified as  belonging to class  z  when the output Cz  exceeds the output for \nall other classes at a point in time t determined by simple segmentation of the input. \nTo train and evaluate the system, we used a database of 22 recorded samples of 10 different \nclasses of \"everyday\" transients such as  the sounds made by  aluminum cans,  plastic tubs, \nhandclaps, and the like. \n\nEach example transient recording was processed through a thirty-two channel constant-Q \nanalog cochlear filter  with  output taps spaced  on  a  logarithmic frequency  scale [6].  For \nthe simulations, the frontend system outputs were sampled and saved to disk, then digitally \nrectified  and  smoothed  with  a  lowpass  filter  function  with  a  2 ms  time  constant.  These \nthirty-two channel outputs representing short-term average energy in each frequency band \nwere decimated to 500 Hz and normalized with the function \n\nx[t, mJ  =  y[t, mJ/ Ly[t, kJ, \n\nM+l \n\n(2) \n\nwhere y[t, M  + 1J  is a constant-valued input added to the system in order to supress noise \nin  the  normalized  outputs during  periods  of silence.  The  additional  output x[t, M  + 1J \n\nk=l \n\n\f680 \n\nR.  T.  Edwards,  G.  Cauwenberghs and F. J.  Pineda \n\nbecomes maximum  during the  periods of silence and  minimum  during presentation of a \ntransient event.  This extra output can be used to detect onsets of transients, but is  not used \nin the correlation computation of equation (1). \n\nTemplate values pz  are learned by automatically aligning all examples of the same class in \nthe training set using a threshold on the  normalization output x[t, M  + 1],  and averaging \nthe values together over N  samples. starting a few  samples before the point of alignment. \nClass outputs are normalized relative to one another by  mUltiplying each output by  a gain \nfactor K z , computed from the template values using the L-2 norm function \n\nKz  = \n\nN \n\nM \nL  LPz[n,m]2. \nm=l n=l \n\n(3) \n\nWe evaluated the accuracy of the system with a cross-validation loop in which we train the \nsystem on all of the database except one example of one class, then test on that remaining \nexample.  repeating  the  test  for  each  of the  220 examples in  the  database.  The  baseline \nalgorithm gives a classification accuracy of 96.4%. \n\n3  Single-bit template values \n\nA major consideration for hardware implementations (both digital and analog) is the mem(cid:173)\nory  storage required by  the  templates,  one of which  is  required  for each class.  Minimal \nstorage space in terms of bits per template is practical only if the algorithm can be proved \nto perform acceptably well under decreased levels of quantization of the template values. \n\nAt one bit per template location (i.e., M  x N  bits per template), the complexity of the hard(cid:173)\nware is greatly simplified, but it is no longer obvious what method is best to use for learn(cid:173)\ning the template values, or for calculating the per-class gains.  The choice of the method is \nguided by knowledge about the acoustic transients themselves, and simulation to evaluate \nits effect on the accuracy of a typical classification task. \n\n4  Simulations of different zero-mean representations \n\nOne bit per template value is a desirable goal, but realizing this goal requires reevaluating \nthe original correlation equation.  The input values to be correlated represent band-limited \nenergy spectra, and range from  zero to  some maximum determined by the L-l  normaliza(cid:173)\ntion.  To determine the value of a template bit, the averaged value over all examples of the \nclass in the training set must be compared to a threshold (which itself must be determined), \nor else the input itself must be transformed into a form  with zero average mean  value.  In \nthe latter method, the  template  value is  determined by  the sign  of the  transformed  input, \naveraged over all examples of the class in the training set. \n\nThe obvious transformations of the input which provide a vector of zero-mean signals to the \ncorrelator are the time derivative of each input channel, and the difference between neigh(cid:173)\nboring channels. Certain variations of these are possible, such as a center-surround compu(cid:173)\ntation of channel differences, and zero-mean combinations of time and channel differences. \nWhile there is evidence that center-surround mechanisms are common to neurobiological \nsignal  processing of various sensory  modalities in  the  brain,  including processing  in  the \nmammalian auditory cortex [5],  time derivatives of the input are also  plausible in  light of \nthe  short time  base of acoustic  transient events.  Indeed,  there  is  no  reason  to  assume a \npriori that channel differences are even meaningful on the time scale of transients. \n\nTable 1 shows simulation results, where classification accuracy on the cross-validation test \nis given for different combinations of continuous-valued and binary inputs and templates, \n\n\fOptimizing Correlation Algorithms for Transient Classification \n\n681 \n\nTable  I:  Simulation results with different architectures. \n\nMethod \n\nOne-to-One \n\nTime Difference \n\nChannel Difference \n\nCenter-Surround \n\nBoth \nCont. \n96.40% \n85.59% \n90.54% \n92.79% \n\nBinary \nInput \n-\n\nBoth \nBinary \n\n-\n\n65.32% \n59.46% \n53.60% \n95.05% \n53.60%  95.05% \n\nBinary (1, -1)  Binary (1,0) \n\nTemplate \n\n-\n\n82.43% \n94.59% \n92.34% \n\nTemplate \n\n-\n\n81.98% \n94.14% \n92.34% \n\nand different zero-mean transformations of the input.  There are several significant points \nto  the results of these classification  tasks.  The first  is  to  note that in  spite of the fact that \nacoustic transient events are short-term and the time steps between the bins in the template \nas low as 2 ms, using time differences between samples does not yield reliable classification \nwhen either the input or the template or both is reduced to binary form.  However, reliability \nremains high when the correlation is performed using channel differences. The implication \nis that even the shortest transient events have stable and reliable structure in the frequency \ndomain, a somewhat surprising conclusion given the impulsive nature of most transients. \n\nAnother interesting  point is  that we observe no  significant difference between  the use of \npairwise channel differences and the more complicated center-surround mechanism (twice \nthe channel value minus the value of the two neighboring channels). The slight decrease in \naccuracy for the center-surround in  some instances is most likely due only to the fact that \none less channel contributes information to the correlator than in  the pairwise channel dif(cid:173)\nference computation.  When accuracy is constant, a hardware implementation will  always \nprefer the simpler mechanism. \nVery little difference in accuracy is seen between the use of a binary (1, -1) representation \nand a binary (1,0) representation, in spite ofthe fact that all zero-valued template positions \ndo not contribute to the correlation output.  This lack of difference is a result of the choice \nof the L-l normalization across the input vector, which ensures that the part of the correla(cid:173)\ntion due to positive template values is roughly the same magnitude as that due to negative \ntemplate values,  leading to  a redundant representation which can  be removed without af(cid:173)\nfecting  classification  results.  In  analog  hardware,  particularly  current-mode circuits,  the \n(1,0) template representation is  much simpler to implement. \n\nTime differencing of the input can be efficiently realized in analog hardware by commuting \nthe time-difference calculation to the end of the correlation computation and implementing \nit with a simple switch-capacitor circuit. Taking differences between input channel values, \non the other hand, is no so easily reduced to a simple hardware form.  To find  a reasonable \nsolution,  we simulated  a  number of different combinations of channel  differencing  and \nbinarization.  Table 2 shows a few examples.  The first row is our standard implementation \nof channel  differences  using  binary  (1,0)  templates  and  continuous-valued input.  The \ndrawback of this method in analog hardware is the matching between negative and positive \nparts of the correlation sum.  We found two ways to get around this problem without greatly \ncompromising the system performance: The first,  shown in the second row of Table 2 is to \nadd to the correlation sum only if the channel difference is positive and the template value \nis  1 (one-quadrant multiplication). Another (shown in the last row) is to add the maximum \nof each pair of channels if the template value is  1,  which is  preferable in  that it uses  the \ninput  values directly  and does  not require computing a  difference at  all.  Unfortunately, \nit  also adds  a large  component to  the output which  is  related only to  the  total  energy of \nthe input and therefore is common to all class outputs, reducing the dynamic range of the \nsystem. \n\n\f682 \n\nR.  T Edwards.  G.  Cauwenberghs  and F. J  Pineda \n\nTable 2:  Simulation results for different methods of computing channel differences \n\nmethod \n\nchannel difference \n\none-quadrant multiply \n\nmaximum channel \n\naccuracy \n94.14% \n92.34% \n93.69% \n\n5  Optimization of the classifier using per-class gains \n\nThe per-class gain values Kz in equation (1) are optimal for the baseline algorithm when us(cid:173)\ning the L-2 normalization.  The same normalization applied to the binary templates (when \nthe  template  value  is  assumed  to  be  either  +1  or  -1) yields  the  same  K z  value  for  all \nclasses.  This unity gain on all class outputs is assumed in  all the simulations of the previ(cid:173)\nous section.  A careful evaluation of errors from  several runs indicated the possibility that \ndifferent gains on each channel could improve recognition rates,  and simple experiments \nwith values tweaked by hand proved this suspicion to be true. \n\nTo automate the process of gain optimization, we consider the templates, as determined by \naveraging together examples of each class in the training set, to be fixed.  Then we compute \nthe correlation between each template and the aligned, averaged inputs for each class which \nwere used to generate the templates.  The result is a Z  x  Z  matrix, which we denote C, of \nexpected values for the correlation between a typical example of a transient input and the \ntemplate for its own class (diagonal elements Cii ) and  the templates for all  other classes \n(off-diagonal elements Cij,  i  '=I  j).  Each column of C  is  like  the  correlator outputs on \nwhich we make a classification decision by choosing the maximum.  Therefore we wish to \nmaximize Cii  with respect to  all  other elements in  the  same column.  The only degree of \nfreedom for adjusting these values is to multiply the correlation output of each template z \nby  a constant coefficient K z .  This corresponds to multiplying each row of C  by  K z .  This \nper-class gain mechanism is easily transferred to the analog hardware domain. \n\nIn  the case of continuous-valued templates,  an optimal solution can  be directly evaluated \nand  yields  the  L-2  normalization.  However,  for  all  binary  forms  of the  template  and/or \nthe input,  direct evaluation is  impossible  and the  solution must  be found  by  choosing an \nerror function E to minimize or maximize.  The error function must assign a large error to \nany  off-diagonal element in a column that approaches or exceeds the diagonal element in \nthat column, but must not force the cross-correlations to arbitrarily low negative values.  A \nminimizing function that fits  this description is \n\nE =  L L exp (KjCji  - KiCii ). \n\n(4) \n\ni  #i \n\nThis function unfortunately has no closed-form solution for the coefficients Ki, which must \nbe determined numerically using Newton-Raphson or some other iterative method. \n\nImprovements in the recognition rates of the classification task using this optimization of \nper-class gains is shown in Table 3, where we have considered only the case of inputs and \ntemplates encoding channel differences.  Although the database is small, the gains of 2 to \n4% for the quantized cases are significant. For this particular simulation we used a different \ntype of frontend  section  to  verify  that  the  performance of the correlation  algorithm  was \nnot linked to  a specific  frontend  architecture.  To  generate these  performance values,  we \nused  sixteen channels with  the inputs digitally  processed through  a constant-Q bandpass \nfilter  having a  Q  of 5.0 and  with  center frequencies  spaced  on  a mel  scale from  100Hz \nto  4500 Hz.  The  bandpass  filtering  was  followed  by  rectification  and  smoothing  with  a \nlowpass filter function with a cutoff frequency scaled logarithmically across channels, from \n60 Hz to  600 Hz.  The channel output data were decimated to  a 500 Hz rate.  Half of the \n\n\fOptimizing Correlation Algorithms/or Transient Classification \n\n683 \n\ndatabase was used to train the system, and half used to test.  Performance is similar to that \nreported in  the previous section in  spite of the fact that the number of channels was cut in \nhalf, and the number of training examples was also cut in half.  Slight gains in performance \nare most likely due to the cleaner digital filtering of the recorded data. \n\nTable 3:  System accuracy with and without per-class normalization. \n\naccuracy, optimized \nbinarization \nnone \n100% \ntemplate only \n93% \ntemplate &  input  95% \n\naccuracy, non-optimized \n100% \n91% \n91% \n\n6  System Robustness \n\nWe  performed several  additional experiment in  addition  to those covered in  the previous \nsections.  One of these was an evaluation of recognition accuracy as a function of the tem(cid:173)\nplate length N  (number of time bins), to determine what is  a proper size for the templates. \nThe result is shown in Figure 2 (a). This curve reaches a reliable maximum at about 50 time \nbins,  from  which our chosen size  for the hardware implementation of 64 bins provides a \nsafe margin of error.  However, it is interesting to  note that recognition accuracy does not \ndrop to that of random chance until only two time bins are used (64 bits per template), and \naccuracy is nearly 50% with only 3 time bins (96 bits per template). \n\n100~--~~~====~======~ \n90 \n\n100r---~--~----~--~----~--~ \n\n~ 80 \n<fl. \n;: 70 \n[:6 \n5  60 \nu \nu \nc(  50 \nE \n~ 40 \n~ \n(J)  30 \n\n20 \n\n80 \n\n20 \n\n1%~--~20~--~40~---OO~--~8~0 ----~100 \n\nCorrelator length N \n\n-~O \n\n-20 \n\n- 10 \n\n0 \n\nSNR (dB) \n\n10 \n\n20 \n\n30 \n\n(a) \n\n(b) \n\nFigure 2:  (a) Effect of decreasing the number of time-bins.  (b) Effect of white noise added \nto the correlator inputs. \n\nWe  made  one evaluation  of the  robustness  of the  algorithm  in  the  presence of noise  by \nintroducing additional  white  noise at  the correlator inputs.  The graph of Figure 2 (right) \nshows that accuracy remains high until the signal-to-noise ratio is roughly OdB. \n\nAn interesting question to ask about the L-l normalization at the frontend is how the added \nconstant normalization channel (y[t, M  + 1]) affects the classification performance. If this \nchannel  is  omitted,  then  the total  instantaneous value of all  outputs must equal  the  same \nvalue, even during periods of silence, in which low-level noise gets amplified. The nominal \nvalue of this channel was chosen to  match the levels of noise in  the  transient recordings. \nFor one of the cases of Table  1 (real input, binary (1,0) template, channel differencing at \n\n\f684 \n\nR.  T.  Edwards,  G.  Cauwenberghs and F. 1.  Pineda \n\nthe input), we tried two other tests,  one with the normalization constant doubled, and one \nwith it omitted (zero).  Doubling the normalization constant had no effect on the error rate, \nwhile omitting it caused the accuracy to drop only from  94.1 % to 92.3%.  The conclusion \nis  that  for  large  templates,  random  noise  has  a  low  probability  of producing a  spurious \npositive correlation that would  be classified as a transient.  The classification algorithm is \nnot largely dependent on input signal normalization. \n\n7  Conclusions \n\nStarting  from  a  template correlation  architecture for  acoustic  transient classification  tar(cid:173)\ngeted for high-density, low-power analog VLSI implementation, we have investigated sev(cid:173)\neral  variants  on  the  correlation  algorithms,  accounting  for  the  strengths  and  constraints \nof the VLSI  implementation medium  while maintaining acceptable classification  perfor(cid:173)\nmance. \n\nReduction of input and templates to binary form does not significantly affect performance, \nas  long  as  they  are transformed  to  encode the difference in  neighboring channels of the \noriginal filterbank  frontend outputs.  This suggests that acoustic transient classification  is \nnot only  amenable to  implementation in  simple analog  hardware,  but also  in  reasonably \nsimple digital hardware. \nIn  looking for zero-mean representations of the input compatible with a binary template, \nwe found that computing pairwise differences between channels gives a more robust rep(cid:173)\nresentation than a time-differential form, as was reported previously in [1].  We have found \nthat computing a  center-surround function  of the  inputs yields  virtually  the same results \nas  taking  pairwise channel differences.  Where hardware implementation is the goal,  the \npairwise difference function is preferred due to its greater simplicity. \n\nWe  have additionally shown that cross-correlations between aligned,  averaged inputs and \ntemplates can  be used  with an  iterative method to  solve for optimal gain  coefficients per \nclass output, which yield better classification performance.  This is a method which can be \napplied in general to all template correlation systems. \n\nReferences \n\n[1]  F.  J.  Pineda,  G.  Cauwenberghs,  R.  T.  Edwards,  \"Bangs,  Clicks,  Snaps,  Thuds,  and \nWhacks:  An  Architecture for  Acoustic  Transient  Processing,\"  Neural  Information \nProcessing Systems (NIPS), Denver,  1996. \n\n[2]  K.  P.  Unnikrishnan,  J.  J.  Hopfield,  and  D.  W.  Tank, \n\n\"Connected-Digit  Speaker(cid:173)\nDependent Speech Recognition Using a Neural Network with Time-Delayed Connec(cid:173)\ntions,\"  IEEE Transactions on Signal Processing, 39, pp. 698-713,1991. \n\n[3]  R.  T.  Edwards,  G.  Cauwenberghs,  and F.  J.  Pineda,  \"A  Mixed~Signal Correlator for \nAcoustic Transient Classification,\"  International Symposium on Circuits and Systems \n(ISCAS), Hong Kong, June 1997. \n\n[4]  R.  T.  Edwards and G.  Cauwenberghs,  \"A Second-Order Log-Domain Bandpass Filter \nfor Audio Frequency Applications,\" International Symposium on Circuits and Systems \n(ISCAS), Monterey, CA, June  1998. \n\n[5]  K. Wang and S. Shamma, \"Representation of Acoustic Signals in the Primary Auditory \n\nCortex,\"  IEEE Trans.  Audio and Speech Processing, 3(5), pp. 382-395, 1995. \n\n[6]  F. J. Pineda, K. Ryals, D. Steigerwald, and P. Furth, \"Acoustic Transient Processing us(cid:173)\ning the Hopkins Electronic Ear,\"  World Conference on Neural Networks, Washington, \nD.C.,  1995. \n\n\f", "award": [], "sourceid": 1621, "authors": [{"given_name": "R.", "family_name": "Edwards", "institution": null}, {"given_name": "Gert", "family_name": "Cauwenberghs", "institution": null}, {"given_name": "Fernando", "family_name": "Pineda", "institution": null}]}