{"title": "A Comparison of Image Processing Techniques for Visual Speech Recognition Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 939, "page_last": 945, "abstract": null, "full_text": "A  comparison of Image  Processing \n\nTechniques for  Visual  Speech Recognition \n\nApplications \n\nMichael  S.  Gray \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\nSan Diego,  CA  92186-5800 \n\nTerrence J.  Sejnowski \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\nSan Diego,  CA  92186-5800 \n\nJavier  R.  Movellan* \n\nDepartment of Cognitive Science \nInstitute for  Neural  Computation \nUniversity of California San Diego \n\nAbstract \n\nWe  examine  eight  different  techniques  for  developing  visual  rep(cid:173)\nresentations  in  machine  vision  tasks.  In  particular  we  compare \ndifferent  versions  of  principal  component  and  independent  com(cid:173)\nponent  analysis  in  combination  with  stepwise  regression  methods \nfor  variable selection.  We  found  that local methods,  based on the \nstatistics of image patches, consistently outperformed global meth(cid:173)\nods based on the statistics of entire images.  This result is consistent \nwith  previous  work  on  emotion  and facial  expression  recognition. \nIn  addition, the use of a stepwise regression technique for  selecting \nvariables and regions of interest substantially boosted performance. \n\n1 \n\nIntroduction \n\nWe  study  the performance  of eight  different  methods  for  developing  image  repre(cid:173)\nsentations based on the statistical properties of the images at hand.  These methods \nare  compared  on  their  performance  on  a  visual  speech  recognition  task.  While \nthe  representations  developed  are  specific  to  visual  speech  recognition,  the  meth(cid:173)\nods  themselves  are  general  purpose  and  applicable  to  other  tasks.  Our  focus  is \non  low-level  data-driven  methods  based  on  the  statistical  properties  of relatively \nuntouched  images,  as  opposed  to  approaches  that  work  with  contours  or  highly \nprocessed versions of the image.  Padgett [8]  and Bartlett [1]  systematically studied \nstatistical methods for  developing  representations on  expression recognition tasks. \nThey found that local wavelet-like representations consistently outperformed global \nrepresentations, like  eigenfaces.  In  this  paper  we  also  compare local versus  global \nrepresentations.  The  main  differences  between  our  work  and  that  in  [8]  and  [1] \n\n*  To whom correspondence should be addressed. \n\n\fFigure 1:  The normalization procedure.  In each panel, the \"+\" indicates the center \nof the  lips,  and  the  \"0\"  indicates  the  center  of the  image.  The  location  of the \nlips  was  automatically determined  using Luettin et al.  point  density model  for  lip \ntracking:  (1)  Original image;  (2)  The center of the lips was translated to the center \nofthe image;  (3)  The image was rotated in the plane to horizontal;  (4)  The lips were \nscaled  to  a  constant  reference  width;  (5)  The  image  was  symmetrized  relative  to \nthe vertical midline;  (6)  The intensity was normalized using a  logistic gain control \nprocedure. \n\nare:  (1)  We  use  image  sequences  while  they used  static images;  (2)  Our work  in(cid:173)\nvolves  images  of the  mouth  region  while  their  work  involves  images  of the  entire \nface;  (3)  Our recognition engine is  a bank of hidden Markov model while theirs is  a \nbackpropagation network [8]  and a nearest neighbor classifier [1].  In  addition to the \ncomparison of local and global representations, we propose an unsupervised method \nfor  automatically selecting regions  and variables of interest. \n\n2  Preprocessing and  Recognition Engine \n\nThe task was  recognition of the words  \"one\",  \"two\",  \"three\"  and  \"four\"  from  the \nTulips1  [7]  database.  The database consists  on movies  of 12  subjects each uttering \nthe  digits  in English twice.  While the number of words is  limited,  the  database is \nchallenging due to differences in illumination conditions, ethnicity and gender of the \nsubjects.  Image preprocessing consisted of the following  steps:  First the contour of \nthe outer lips were tracked using point distribution models, a data-driven technique \nbased on analysis ofthe gray-level statistics around lip contours [5].  The lip images \nwere  then normalized for  translation  and  rotation.  This was  accomplished by first \npadding the image  on  all  sides  with  25  rows  or  columns  of zeros,  and  modulating \nthe  images  in  the  spatial frequency  domain.  The  images  were  symmetrized  with \nrespect  to  the  vertical  axis  going  through  the  center  of the  lips.  This  makes  the \nfinal  representation more robust to horizontal changes in  illumination.  The images \nwere cropped to 65 pixels vertically  x  87 pixels horizontally (see Figure 1)  and their \nintensity was  normalized using logistic  gain control  [7].  Eight  different  techniques \nwere  used  on  the  normalized  database  each  of which  developed  a  different  image \nbasis.  For each of these techniques the following steps were followed:  (1)  Projection: \nFor each image in  the database we  compute the coordinates x(t) of the image with \nrespect to the image bases developed using each of the eight techniques;  (2)  Tempoml \ndifferentiation:  For  each  time step  we  compute the  vectors  8(t)  = x(t) - x(t - 1), \nwhere  x(t)  represents the coordinate vector of image presented at time t;  (3)  Gain \ncontrol:  Each  component  of x(t)  and  8(t)  is  independently  scaled  using  a  logistic \ngain control function  matched to the mean and variance of each  component across \nan  entire  movie  [7] .  This  results  in  a  form  of  soft  histogram  equalization;  (4) \n\n\fGlobal \nPCA \n\nPCA \n\nSpectrum \n\nGlobal \n\nTCA \n\nlCA \n\nSpectrum \n\nFigure 2:  Global decompositions for  the normalized image  dataset.  Row  1:  Global \nkernels of principal component  analysis ordered with first  eigenimage on left.  Row \n2:  Log magnitude spectrum of eigenimages.  Row 3:  Global pixel space independent \ncomponent kernels ordered according to projected variance.  Row 4:  Log magnitude \nspectrum of global independent components. \n\nRecognition:  The scaled  x(t)  and 8(t)  coefficients  are fed  to the HMM  recognition \nengine. \n\n3  Global Methods \n\nWe  first  evaluated  the  performance  of  techniques  based  on  the  statistics  of  the \nentire  lip  images  as  opposed  to  portions  of  it.  This  global  approach  has  been \nshown  to provide good performance on face  recognition  [9],  expression recognition \n[2],  and  gender  recognition  tasks  [4].  In  particular we  compared the  performance \nof principal  component  analysis  (PCA)  and  two  different  versions  of independent \ncomponent  analysis  (ICA). \n\n3.1  Global PC A: \n\nWe  tried image bases that consisted of the first  50,  100 and 150 eigenvectors of the \npixelwise  covariance matrix.  Best  results were  obtained with the first  50  principal \ncomponents  (which  accounted  for  94.6%  of the  variance)  and  are  the  only  ones \nreported here.  The top row  of Figure 2 shows the first  5 eigenvectors displayed  as \nimages,  their magnitude spectrum is  shown in the second  row.  These eigenimages \nhave  most  of their  energy  localized  in  low  and  horizontal  spatial frequencies  and \nare typically non-local in the spatial domain  (i.e.,  have non-zero energy distributed \nover the whole  image). \n\n3.2  Global ICA: \n\nThe goal of lnfomax ICA is to transform an input random vector such that the en(cid:173)\ntropy of the output vector is maximized [3].  The main differences between ICA and \nPCA are:  (1) ICA maximizes the joint entropy of the outputs, while PCA maximizes \nthe  sum  of their  variance;  (2)  PCA  provides  orthogonal basis  vectors,  while  rcA \nbasis  vectors  need  not  be  orthogonal;  (3)  PCA  outputs  are  always  uncorrelated, \nbut  may  not  be  statistically  independent.  ICA  attempts  to  extract  independent \noutputs, not just uncorrelated.  We  tried two different  ICA approaches: \n\nICA  I:  This method results in  a  non-orthogonal transformation of the bases  de(cid:173)\nveloped via PCA. While such transformations do not change the underlying space of \n\n\fFigure  3:  Upper  left:  Lip  patches  (12  pixels  x  12  pixels)  from  randomly  chosen \nlocations  used  to  develop  local  PCA  and  local  lCA  kernels.  Lower  left:  Four  or(cid:173)\nthogonal  images  generated  from  a  single  local  PCA  kernel.  Right:  Top  10  Local \nPCA and lCA kernels ordered according to projected variance  (highest at top left). \nNote how  the lCA vectors tend to be more local and consistent with the receptive \nfields  found  in VI. \n\nthe representation they may facilitate the job of the recognition engine by decreas(cid:173)\ning  the  statistical  dependency  amongst  the  coordinates.  First  each  image  in  the \ndatabase was  projected  onto the space spanned by the first  50  eigenvectors of the \npixelwise  covariance matrix.  Then lCA  was  performed on the 50  PCA coordinate \nvariables to obtain a  new  50-dimensional non-orthogonal basis. \n\nlCA II:  A different  approach to lCA was explored in [1]  for face recognition tasks \nand by [6]  for fMRI images.  While in lCA-l the goal is to develop independent image \ncoordinates, in rcA-II the goal is for the image bases themselves to be independent. \nHere independence of images is defined with respect to a probability space in which \npixels  are seen  as  outcomes and images as  random vectors of such outcomes.  The \napproach,  which  is  described  in  detail  in  [6],  resulted  in  a  set  of 50  images  which \nwere  a  non-orthogonal linear transformation of the first  50 eigenvectors of the pix(cid:173)\nelwise  covariance matrix.  The first  5 images  (accounting for the largest amounts of \nprojected variance) obtained via this approach to lCA are shown in the third row of \nFigure 2.  The fourth  row  shows their magnitude spectrum.  As  reported in  [1]  the \nimages obtained using this method are more local than those obtained via PCA. \n\n4  Local Methods \n\nPadgett et al.  [8]  reported surprisingly good results on an emotion recognition tasks \nusing PCA on random patches of the face  instead of the entire face.  Recent theoret(cid:173)\nical work also places emphasis on spatially localized, wavelet-like image bases.  One \npotential advantage of spatially localized image  bases is that they provide explicit \ninformation about where things  are happening,  not just  about  what is  happening. \nThis facilitates  the  work  of recognition  engines  on  some  tasks but  the  theoretical \nreasons for  this are unclear at this point. \n\nLocal  PCA  and  lCA  kernels  were  developed  based  on  a  database  of 18680  small \npatches (12 pixel  x  12 pixel) chosen from random locations in the Tulip1s database. \nA  sample of these  random  patches  (superimposed  on a  lip  image)  is  shown  in the \ntop panel of Figure 3.  Hereafter we  refer to the 12 pixel  x  12 pixel images obtained \n\n\fPCAKemeil \n\nPCAKemel2 \n\n2 \n\n20 \n\n\" n \n\n\" \" \n\n.. \n, \n\nICAKemeil \n\nICAKemel9 \n\n,..LLL  \" \n\" \n\n,  ~  \" \" \n\n\" , \n\n'\" \n\n10 \n\n41\n\nFigure 4:  Kernel-location  combinations  chosen using unblocked  variable  selection. \nTop  of each  quadrant:  Local rcA or peA kernel.  Bottom  of each  quadrant:  Lip \nimage convolved with corresponding local kernel, then downsampled.  The numbers \non the lip  image indicate the order in  which  variables were  chosen for  the multiple \nregression  procedure.  There  are  no  numbers  on  the  right  side  of  the  lip  images \nbecause only half of each lip image was used for the representation (since the images \nare symmetrized). \n\nvia  peA  or  leA as  \"kernels\".  Image  bases  were  generated  by  centering  a  local \npeA  or  leA  kernel  onto  different  locations  and  padding  the  rest  of  the  matrix \nwith zeros, as displayed in  Figure 3  (lower left  panel).  This results on bases images \nwhich  are  local in  space  (the  energy is  localized  about  a  single  patch)  and  shifted \nversions of each other.  The process of obtaining image  coordinates can be  seen  as \na  filtering operation followed  by subsampling:  First the images are filtered  using a \nbank of filters  whose impulse response are the kernels obtained via peA (or leA). \nThe relevant coordinates are obtained by subsampling at 300 uniformly distributed \nlocations  (15  locations  vertically  by  20  locations  horizontally).  We  explored  four \ndifferent filtering approaches:  (1)  Single linear shift invariant filter  (LSI);  (2)  Single \nlinear shift  variant filter  (LSV);  (3)  Bank of LSI  filters  with  blocked  selection;  (4) \nBank of LSI filters  combined with unblocked selection. \nFor the single-filter LSI approach, the images were convolved with a  single local leA \nkernel or a local peA kernel.  The top 5 local peA and leA kernels were each tested \nseparately and the results obtained with the best of the 5 kernels were reported.  For \nthe single LSV-filtering approach different local peA kernels were derived for a total \nof 117 non-overlapping regions each of which occupied 5  x  5 pixels.  Each region of \nthe  934  images was  projected onto the first  principal component  corresponding to \nthat location.  This effectively resulted in  an LSV  filtering operation. \n\n4.1  Automatic Selection of Focal  Points \n\nPadgett's [8]  most successful method was based on outputs of local filters at manu(cid:173)\nally selected focal regions.  Their task was emotion recognition and the focal regions \nwere  the eyes  and mouth.  In  visual  speech  recognition once the lips  are  chosen it \n\n\fGlobal Methods \n\nLocal Methods \n\nImage Processing \n\nGlobal peA \nGlobal Il;A I \nUlobal ICA  II \n\nSingle-Filter LSI  peA \nSingle-Filter LSI ICA \n\nBlocked Filter Bank PeA \nBlocked Filter Bank leA \n\nUnblocked Filter Bank peA \nUnblocked  Filter Bank Il;A \n\nPerformance  \u00b1  s.e.m. \n\n79.2  \u00b1  4.7 \n61.5  \u00b1  4.5 \n74.0  \u00b1  5.4 \n90.6  \u00b1  3.1 \n89.6  \u00b1  3.0 \n85.4  \u00b1  3.7 \n85.4  \u00b1  3.0 \n91.7  \u00b1  2.8 \n91.7 \u00b1  3.2 \n\nTable 1:  Best generalization performance (%  correct) \u00b1  standard error of the mean \nfor  all image representations. \n\nis  unclear which  regions would  be most  informative.  Thus we  developed  a  method \nfor  automatic selection of focal  regions. \n\nFirst  10  filters  were  developed  via  local  leA  (or  peA).  Each  image  was  filtered \nusing  the  10-filter  bank  and  the  outputs  were  subsampled  at  150  locations  for  a \n1500  dimensional  representation  (10  filters  x  150  locations)  of each of the images \nin the dataset.  Regions and variables of interest were then selected using a stepwise \nforward  multiple  regression  procedure.  First  we  choose  the  variable  that,  when \naveraging across the entire database,  best  reconstructed the original images.  Here \nbest  reconstruction is  defined  in  terms of least  squares using a  multiple  regression \nmodel.  Once  a  variable  is  selected,  it  is  \"tenured\"  and  we  search for  the  variable \nwhich  in  combination with the tenured ones best reconstructs the image database. \nThe procedure is  stopped when the number of tenured variables reaches a  criterion \npoint.  We  compared  performance  using  50,  100,  and  150  tenured  variables  and \nreport results with the best of those three numbers.  We tested two different selection \nprocedures, one blocked by location and one in  which location was not blocked.  In \nthe first  method the selection was  done in  blocks  of 10  variables  where each block \ncontained the outputs of all the filters at a specific location.  If a location was chosen, \nthe outputs of the 10 filters in that location were automatically included in the final \nimage representation.  In the second method selection of variables was not  blocked \nby location. \n\nFigure 4 shows, for  2 local peA and 2 local leA kernels, the first 10 variables chosen \nfor  each particular kernel using the forward selection multiple regression procedure. \nThe numbers on the lip images in this figure  indicate the order in which particular \nkernel/location variables were chosen using the sequential regression procedure:  \"I\" \nindicates the first  variable chosen,  \"2\"  the second, etc. \n\n5  Results and  Conclusions \n\nTable 1 shows the best generalization performance (out of the 9 HMM  architectures \ntested)  for  each  of the eight  image  representation  methods.  The local  decomposi(cid:173)\ntions  significantly  outperformed  the  global  ones  (t(106)  =  4.10,  p  < 0.001).  The \nimproved  performance of local  representations  is  consistent  with  current  ideas  on \nthe  importance  of  localized  wavelet-like  representations.  However,  it  is  unclear \nwhy  local  decompositions  work  better.  One  possibility  is  that these  results  apply \nonly  to  this  particular  recognition  engine  and  the  problem  at  hand  (i.e.,  hidden \nMarkov  models  for  speechreading).  Yet  similar  results  with  local  representations \nwere  reported in  [8]  on  an emotion  classification task with  a  3 layer backpropaga-\n\n\ftion network and in [1]  on an expression classification tasks with a nearest neighbor \nclassifier.  Another  possible  explanation for  the  advantage of local  representations \nis  that  global  unsupervised  decompositions  emphasize  subject  identity  while  local \ndecompositions tend to  hide  it.  We  found  some  evidence  consistent  with this idea \nby  testing  global  and  local  representations  on  a  subject  identification  task  (i.e., \nrecognizing which  person the lip  images belong to).  For this task the global repre(cid:173)\nsentations outperformed the local ones.  However this result is  inconsistent with [8] \nwhich found local representations were better on emotion classification and on sub(cid:173)\nject identification tasks.  Another possibility is that local representations make more \nexplicit information about where things are happening, not just what is  happening, \nand such information turns out to be important for  the task at hand. \nThe image representations obtained using the bank of filter methods with unblocked \nselection yielded  the best results.  The stepwise  regression technique used to select \nkernels  and regions of interest  led  to substantial gains in recognition performance. \nIn fact the highest generalization performance reported here (91. 7% with the bank of \nfilters using unblocked variable selection) surpassed the best published performance \non this dataset  [5]. \n\nReferences \n\n[1]  M.S.  Bartlett.  Face  Image  Analysis  by  Unsupervised  Learning  and Redundancy \n\nReduction.  PhD thesis,  University of California, San Diego,  1998. \n\n[2]  M.S.  Bartlett,  P.A.  Viola,  T.J.  Sejnowski,  J.  Larsen,  J.  Hager,  and  P.  Ekman. \nClassifying facial  action.  In D.  Touretski, M.  Mozer,  and M.  Hasselmo, editors, \nAdvances  in Neural Information  Processing  Systems,  volume  8,  pages 823-829. \nMorgan Kaufmann, San Mateo,  CA,  1996. \n\n[3]  A.J.  Bell  and T.J.  Sejnowski.  An  information-maximization approach to blind \nseparation and blind deconvolution.  Neural  Computation, 7(6):1129-1159,1995. \n[4]  G.  Cottrell and  J.  1991  Metcalfe.  Face,  gender  and  emotion  recognition using \nholons.  In  D.  Touretzky,  editor,  Advances  in  Neural  Information  Processing \nSystems,  volume 3,  pages 564- 571, San  Mateo,  CA,  1991.  Morgan Kaufmann. \n\n[5]  Juergen Luettin.  Visual Speech  and Speaker Recognition.  PhD thesis, University \n\nof Sheffield,  1997. \n\n[6]  M.J. McKeown, S.  Makeig, G.G.  Brown, T-P. Jung, S.S.  Kindermann, A.J. Bell, \nand  T.J.  Sejnowski.  Analysis  of fmri  data by  decomposition  into  independent \ncomponents.  Proc.  Nat.  Acad.  Sci.,  in press. \n\n[7]  J .R.  Movellan.  Visual  speech  recognition  with  stochastic  networks. \n\nIn \n\nG.  Tesauro,  D.S.  Touretzky,  and  T.  Leen,  editors,  Advances  in  Neural  Infor(cid:173)\nmation  Processing  Systems,  volume  7,  pages  851- 858.  MIT  Press,  Cambridge, \nMA,1995. \n\n[8]  C.  Padgett  and  G.  Cottrell.  Representing  face  images  for  emotion  classifica(cid:173)\n\ntion.  In  M.  Mozer,  M.  Jordan,  and  T.  Petsche,  editors,  Advances  in  Neural \nInformation  Processing  Systems,  volume 9,  Cambridge, MA,  1997.  MIT Press. \n[9]  M.  Turk  and  A.  Pentland.  Eigenfaces  for  recognition.  Journal  of  Cognitive \n\nNeuroscience,  3(1):71- 86,  1991. \n\n\f", "award": [], "sourceid": 1877, "authors": [{"given_name": "Michael", "family_name": "Gray", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}]}