{"title": "Surface Learning with Applications to Lipreading", "book": "Advances in Neural Information Processing Systems", "page_first": 43, "page_last": 50, "abstract": null, "full_text": "Surface Learning with Applications to \n\nLipreading \n\nChristoph Bregler *.** \n\n*Computer Science  Division \n\nUniversity  of California \n\nBerkeley,  CA 94720 \n\nStephen M.  Omohundro ** \n\n**Int.  Computer Science  Institute \n\n1947 Center  Street  Suite 600 \n\nBerkeley,  CA 94704 \n\nAbstract \n\nMost connectionist research has focused on learning mappings from \none space to another (eg.  classification and regression).  This paper \nintroduces  the  more  general  task  of learning  constraint  surfaces. \nIt describes  a  simple  but  powerful  architecture  for  learning  and \nmanipulating  nonlinear  surfaces  from  data.  We  demonstrate  the \ntechnique  on low dimensional synthetic surfaces  and compare it to \nnearest  neighbor  approaches.  We  then  show  its  utility in learning \nthe space of lip images in a system for improving speech recognition \nby lip  reading.  This learned  surface  is  used  to improve the visual \ntracking  performance during recognition. \n\n1  Surface Learning \n\nMappings  are  an  appropriate representation for  systems  whose  variables  naturally \ndecompose into \"inputs\"  and  \"outputs)).  To use  a learned mapping, the input vari(cid:173)\nables must be known and error-free and a single output value must be estimated for \neach input.  Many tasks in vision, robotics,  and control must maintain relationships \nbetween  variables  which  don't  naturally  decompose  in  this  way.  Instead,  there  is \na  nonlinear  constraint  surface  on  which  the  values  of the  variables  are  jointly re(cid:173)\nstricted to lie.  We propose a representation for such surfaces  which supports a wide \nrange of queries  and which  can  be  naturally learned from  data. \n\nThe simplest queries are \"completion queries)).  In these queries, the values of certain \nvariables  are  specified  and  the  values  (or  constraints  on  the  values)  of remaining \n\n43 \n\n\f44 \n\nBregler and Omohundro \n\nFigure  1:  Using a  constraint surface  to reduce  uncertainty in two  variables \n\n~. \n\nFigure 2:  Finding the closest  point in  a  surface  to a  given  point. \n\nvariables are to be determined.  This reduces to a conventional mapping query if the \n\"input\"  variables  are  specified  and  the system reports  the values  of corresponding \n\"output\"  variables.  Such  queries  can  also  be used  to invert  mappings, however,  by \nspecifying  the  \"output\"  variables  in  the  query.  Figure  1 shows  a  generalization in \nwhich the variables are known  to lie  with  certain ranges  and the constraint surface \nis  used  to further  restrict  these  ranges. \n\nFor recognition tasks,  \"nearest  point\"  queries  in which  the system must return the \nsurface  point  which  is  closest  to  a  specified  sample  point  are  important  (Figure \n2).  For example, symmetry-invariant classification  can  be performed by  taking the \nsurface to be generated by applying all symmetry operations to class prototypes (eg. \ntranslations, rotations,  and scalings of exemplar characters  in an OCR system).  In \nour  representation  we  are  able  to efficiently find  the globally nearest  surface  point \nin this  kind  of query. \n\nOther  important  classes  of  queries  are  \"interpolation  queries\"  and  \"prediction \nqueries\".  For these,  two or more points on a curve are specified and the goal is to in(cid:173)\nterpolate between  them or extrapolate beyond  them.  Knowledge  of the  constraint \nsurface  can  dramatically  improve  performance  over  \"knowledge-free\"  approaches \nlike linear or spline interpolation. \n\nIn  addition  to supporting these  and other queries,  one  would  like  a  representation \nwhich  can  be  efficiently  learned.  The  training  data  is  a  set  of points  randomly \ndrawn  from  the  surface.  The  system  should  generalize  from  these  training  points \nto form  a  representation  of the  surface  (Figure  3).  This task  is  more difficult  than \nmapping learning for several reasons:  1)  The system must discover the dimension of \nthe surface,  2)  The surface  may be  topologically complex (eg.  a  torus or a  sphere) \n\n\fSurface Learning with Applications to Lipreading \n\n45 \n\n\u2022\u2022\u2022 \n\u2022 \n\u2022\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022\u2022 \n\n\u2022 \u2022\u2022 \n\n\u2022 \u2022 \u2022 \u2022 \n\n\u2022 \u2022\u2022 \n\nFigure 3:  Surface  Learning \n\nand  may not  support  a  single  set  of coordinates,  3)  The  broader  range  of queries \ndiscussed  above must be supported. \n\nOur  approach starts from  the  observation  that if the data points were  drawn from \na  linear surface,  then a  principle components analysis could  be used  to discover the \ndimension of the linear space and to find  the best-fit linear space of that dimension. \nThe largest principle vectors would span the space and there would be a precipitous \ndrop in the principle values at the dimension of the surface.  A principle components \nanalysis  will  no  longer  work,  however,  when  the  surface  is  nonlinear  because  even \na  I-dimensional curve  could  be  embedded  so  as  to  span  all  the  dimensions  of the \nspace. \n\nIf a  nonlinear  surface  is  smooth,  however,  then  each  local  piece  looks  more  and \nmore  linear  under  magnification.  If we  consider  only  those  data  points  which  lie \nwithin a local region,  then to a good approximation they come from a linear surface \npatch.  The  principle values  can  be used  to determine  the most likely  dimension of \nthe  surface  and  that  number  of the  largest  principle  components  span  its  tangent \nspace  (Omohundro,  1988).  The  key  idea  behind  our  representations  is  to  \"glue\" \nthese  local  patches  together using  a  partition of unity. \n\nWe  are  exploring  several  implementations,  but  all  the  results  reported  here  come \nfrom  a  represenation  based  on  the  \"nearest  point\"  query.  The  surface  is  repre(cid:173)\nsented  as  a  mapping from  the  embedding  space  to  itself which  takes  each  point \nto the  nearest  surface  point.  K-means  clustering  is  used  to  determine  a  initial set \nof  \"prototype  centers\"  from  the  data  points.  A  principle  components  analysis  is \nperformed on a  specified  number of the nearest  neighbors of each  prototype.  These \n\"local  peA\"  results  are  used  to estimate the  dimension of the  surface  and to find \nthe best linear projection in the neighborhood of prototype i.  The influence of these \nlocal models is  determined by  Gaussians centered on the prototype location with a \nvariance  determined  by  the  local  sample  density.  The  projection  onto  the  surface \nis  determined  by forming a  partition of unity from these  Gaussians and using it to \nform  a  convex linear combination of the local linear  projections: \n\nThis initial model is  then refined  to minimize the mean squared  error  between  the \n\n(1) \n\n\f46 \n\nBregler and Omohundro \n\na) \n\nb) \n\nFigure  4:  Learning  a  I-dimensional surface.  a)  The  surface  to learn  b)  The  local \npatches  and the range of their influence functions,  c)  The learned surface \n\ntraining samples and the nearest surface point using EM  optimization and gradient \ndescent. \n\n2  Synthetic  Examples \n\nTo see  how  this approach  works,  consider 200  samples drawn from  a  I-dimensional \ncurve in a two-dimensional space  (Figure 4a).  16  prototype centers are chosen by k(cid:173)\nmeans clustering.  At each center,  a local principle components analysis is performed \non the  closest  20  training samples.  Figure 4b  shows  the  prototype centers  and the \ntwo  local  principle  components  as  straight  lines.  In  this  case,  the  larger  principle \nvalue  is  several  times  larger  than  the  smaller one.  The  system  therefore  attempts \nto  construct  a  one-dimensional learned surface.  The  circles  in  Figure 4b  show  the \nextent  of the  Gaussian influence functions  for  each  prototype.  Figure 4c  shows  the \nresulting learned suface.  It was  generated  by randomly selecting 2000  points in the \nneighborhood of the surface  and projecting them  according  to the learned model. \n\nFigure  5  shows  the  same  process  applied  to  learning  a  two-dimensional  surface \nembedded in three  dimensions. \n\nTo quantify the performance of this learning algorithm, we  studied the effect  of the \ndifferent  parameters  on  learning  a  two-dimensional sphere  in  three  dimensions.  It \nis  easy to compare the learned  results  with the  correct  ones  in  this  case.  Figure 6a \nshows  how  the  empirical  error  in  the  nearest  point  query  decreases  as  a  function \nof the  number  of training  samples.  We  compare  it  against  the  error  made  by  a \nnearest-neighbor  algorithm.  With  50  training  samples  our  approach  produces  an \nerror which is  one-fourth as large.  Figure 6b shows how the average size of the local \nprinciple  values  depends  on  the  number  of nearest  neighbors  included.  Because \nthis is  a two-dimensional surface,  the two largest values are well-separated from the \nthird  largest.  The  rate  of growth  of the  principle  values  is  useful  for  determining \nthe dimension of the surface in the  presence  of noise. \n\n\fSurface Learning with Applications to Lipreading \n\n47 \n\nFigure 5:  Learning  a  two-dimensional surface  in  the  three  dimensions a)  1000 ran(cid:173)\ndom samples on the surface  b)  The two  largest  local principle  components at each \nof 100  prototype  centers  based on  25  nearest  neighbors. \n\n:::~--+ ~--:+=~-+=t-+=:--+:~:+~ \n'0000- - j=----~~ c-t-r--t---r \n=:=.  ~ ~f . t::- \u00b7=t~~f\u00b7t~ \n:::- -~~r~l- -:=t:==t~f \n:::: ..::t~ L_  -\n~:--=- -:-:- -\n':::::-\\ I  ==-~~~-l== -- ---\n----+(cid:173)\n-- ----+ \n\n-+--1--\n\nIOD~ --\n\n--\n\n4000- - -\n\n-\n\n. -~:'::: \n\nlBO . OO \n\n160 .00 \n\n120.00 \n\n9>. 00 \n\n60.00 \n\n\".00 \n\n20.00 \n\n1000--\n\n\u2022. oo~ _ _  ~~ _______  _ \n\n'ODD \n\n1OG OO \n\n15000 \n\nZOO 00 \n\n~OOD \n\n3000{) \n\n3SGOO \n\n'.00 \n\n80.00 \n\n100.00 \n\n1110.00 \n\nFigure  6:  Quantitative performance on learning a  two-dimensional sphere  in  three \ndimensions.  a)  Mean squared error of closest  point querries  as function of the num(cid:173)\nber of samples for  the learned surface vs.  nearest training point b) The mean square \nroot of the three  principle values  as  a  function  of number of neighbors  included  in \neach local PCA . \n\n\f48 \n\nBregler and Omohundro \n\na \n\nb \n\nFigure 7:  Snakes for finding  the lip contours a) A correctly placed snake b) A snake \nwhich  has  gotten stuck  in a  local minimum of the simple energy  function. \n\n3  Modelling the space  of lips \n\nWe  are  using  this  technique  as  a  part  of system  to  do  \"lipreading\".  To  provide \nfeatures for  \"vise me classification\"  (visemes are the visual analog of phonemes),  we \nwould like  the system to reliably track the shape of a speaker's lips in video images. \nIt  should  be  able  to  identify  the  corners  of the  lips  and  to  estimate  the  bounding \ncurves robustly under a variety of imaging and lighting conditions.  Two approaches \nto  this  kind  of  tracking  task  are  \"snakes\"  (Kass,  et.  aI,  1987)  and  \"deformable \ntemplates\"  (Yuille,  1991).  Both of these  approaches minimize an  \"energy function\" \nwhich is  a sum of an internal model energy  and an energy  measuring the match to \nexternal image features. \n\nFor  example,  to  use  the  \"snake\"  approach  for  lip  tracking,  we  form  the  internal \nenergy  from  the  first  and  second  derivatives  of the  coordinates  along  the  snake, \nprefering  smoother  snakes  to  less  smooth  ones.  The  external  energy  is  formed \nfrom  an estimate of the negative image gradient along  the snake.  Figure  7a shows \na  snake  which  has  correctly  relaxed  onto  a  lip  contour.  This  energy  function  is \nnot  very  specific  to  lips,  however.  For  example,  the  internal  energy  just  causes \nthe  snake  to  be  a  controlled  continuity  spline.  The  \"lip- snakes\"  sometimes relax \nonto  undesirable  local  minima  like  that  shown  in  Figure  7b.  Models  based  on \ndeformable templates allow a researcher  to more strongly constrain the shape space \n(typically with hand-coded  quadratic linking polynomials), but are  difficult  to use \nfor  representing fine  grain lip features. \n\nOur  approach  is  to  use  surface  learning  as  described  here  to  build  a  model  of the \nspace of lips.  We can then replace the internal energy described above by a quantity \ncomputed from  the distance  to the learned surface  in lip feature space. \n\nOur  training  set  consists  of  4500  images  of  a  speaker  uttering  random  wordsl . \nThe training images are  initially  \"labeled\"  with  the  conventional snake  algorithm. \nIncorrectly  aligned  snakes  are  removed  from  the  database  by  hand.  The  contour \nshape is  parameterized by  the  x and  y coordinates of 40  evenly spaced  points along \nthe  snake.  All  values  are  normalized to  give  a  lip  width  of 1.  Each  lip  contour  is \n\nIThe  data  was  collected  for  an  earlier  lipreading  system  described  in  (Bregler,  Hild, \n\nManke,  Waibel  1993) \n\n\fSurface Learning with Applications to Lipreading \n\n49 \n\n(Ja \n~d \n\nC7b \n\ne \n\nFigure 8:  Two  principle  axes  in a  local  patch in lip space.  a,  b,  and c  are  configu(cid:173)\nrations  along the first  principle axis,  while  d,  e,  and f are  along the third axis. \n\na \n\nb \n\nc \n\nFigure  9:  a)  Initial  crude  estimate of the  contour  b)  An  intermediate step  in  the \nrelaxation c)  The final  contour. \n\ntherefore  a  point  in  an  80-dimensional  \"lip- space\".  The  lip  configurations  which \nactually  occur  lie  on  a  lower  dimensional  surface  embedded  in  this  space.  Our \nexperiments  show  that  a  5-dimensional surface  in  the  80-dimensional lip  space  is \nsufficient  to describe  the contours with single pixel  accuracy in the image.  Figure 8 \nshows  some lip models along two of the principle axes  in the local neighborhood of \none of the  patches.  The lip  recognition system uses  this learned surface  to improve \nthe performance of tracking on new  image sequences. \n\nThe  tracking  algorithm starts with  a  crude  initial estimate of the  lip  position and \nsize.  It chooses  the  closest  model  in  the  lip  surface  and  maps  the  corresponding \nresized  contour  back onto  the  estimated image  position  (Figure  9a).  The external \nimage  energy  is  taken  to  be  the  cumulative magnitude of graylevel  gradient  esti(cid:173)\nmates  along  the  current  contour.  This  term  has  maximum value  when  the  curve \nis  aligned exactly on  the lip  boundary.  We  perform gradient ascent  in the contour \nspace, but constrain the contour to lie in the learned lip surface.  This is  achieved by \nreprojecting  the contour onto the lip  surface  after  each gradient step.  The surface \nthereby  acts  as  the analog of the internal energy in the snake and deformable tem(cid:173)\nplate approaches.  Figure 9b shows  the result  after  a few  steps  and figure  9c  shows \nthe final contour.  The image gradient is estimated using an image filter whose width \nis  gradually reduced  as the search  proceeds. \n\nThe  lip  contours  in  successive  images in  the  video sequence  are found  by  starting \nwith  the  relaxed  contour from  the previous image and performing gradient  ascent \n\n\f50 \n\nBregler and Omohundro \n\nwith the  altered  external image energies.  Empirically, surface-based  tracking is  far \nmore  robust  than  the  \"knowledge-free\"  approaches.  While  we  have  described  the \napproach  in  the  context  of contour  finding,  it  is  much  more  general  and  we  are \ncurrently  extending the  system to model more complex aspects  of the image. \n\nThe full  lipreading system which  combines  the  described  tracking  algorithm and a \nhybrid  connectionist  speech  recognizer  (MLP /HMM)  is  described  in  (Bregler  and \nKonig  1994).  Additionally we  will use  the  lip surface  to interpolate visual features \nto match them  with the  higher  rate auditory features. \n\n4  Conclusions \n\nWe have presented the task of learning surfaces from data and described several im(cid:173)\nportant queries that the learned surfaces should support:  completion, nearest point, \ninterpolation,  and  prediction.  We  have  described  an  algorithm which  is  capable of \nefficiently  performing these  tasks  and  demonstrated  it  on  both synthetic  data and \non  a  real-world lip-tracking  problem.  The approach  can  be  made  computationally \nefficient using the  \"bumptree\"  data structure described  in (Omohundro,  1991).  We \nare  currently  studying  the  use  of  \"model merging\"  to  improve  the  representation \nand  are  also  applying it to robot  control. \n\nAcknowledgements \n\nThis  research  was  funded  in  part  by  Advanced  Research  Project  Agency  contract \n#NOOOO  1493  C0249  and  by  the  International  Computer  Science  Institute.  The \ndatabase was  collected with a grant from  Land Baden Wuerttenberg  (Landesschw(cid:173)\nerpunkt  Neuroinformatik) at Alex Waibel's institute. \n\nReferences \n\nC.  Bregler,  H.  Hild,  S.  Manke  &  A.  Waibel.  (1993)  Improving Connected  Letter \nRecognition by  Lipreading.  In  Proc.  of Int.  Conf.  on  Acoustics,  Speech,  and Signal \nProcessing,  Minneapolis. \nC.  Bregler,  Y.  Konig  (1994)  \"Eigenlips\"  for  Robust  Speech  Recognition.  In  Proc. \nof Int.  Conf.  on  Acoustics,  Speech,  and Signal  Processing,  Adelaide. \n\nM.  Kass,  A.  Witkin, and D.  Terzopoulos.  (1987)  SNAKES: Active Contour Models, \nin  Proc.  of the  First Int.  Conf.  on  Computer  Vision,  London. \n\nS.  Omohundro.  (1988)  Fundamentals of Geometric  Learning.  University  of Illinois \nat  Urbana-Champaign Technical Report  UIUCDCS-R-88-1408. \n\nS.  Omohundro.  (1991)  Bumptrees for  Efficient  Function,  Constraint,  and  Classifi(cid:173)\ncation  Learning.  In  Lippmann,  Moody,  and  Touretzky  (ed.),  Advances  in  Neural \nInformation  Processing  Systems  3.  San  Mateo,  CA:  Morgan Kaufmann. \n\nA. Yuille.  (1991)  Deformable Templates for  Face Recognition,  Journal of Cognitive \nNeuroscience,  Volume 3,  Number 1. \n\n\f", "award": [], "sourceid": 814, "authors": [{"given_name": "Christoph", "family_name": "Bregler", "institution": null}, {"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}