{"title": "Some Solutions to the Missing Feature Problem in Vision", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 400, "abstract": null, "full_text": "Some Solutions to the Missing Feature Problem \n\nin Vision \n\nSubutai Ahmad \n\nSiemens AG, \n\nVolker Tresp \nSiemens AG, \n\nCentral Research and Development \nZFE ST SN61, Otto-Hahn Ring 6 \n\n8000 Miinchen 83, Gennany. \n\nahmad@icsi.berkeley.edu \n\nCentral Research and Development \nZFE ST SN41, Otto-Hahn Ring 6 \n\n8000 Miinchen 83, Gennany. \ntresp@inf21.zfe.siemens.de \n\nAbstract \n\nIn visual processing the ability to deal with  missing and noisy informa(cid:173)\ntion is crucial. Occlusions and unreliable feature detectors often lead to \nsituations where little or no  direct information about features  is  availa(cid:173)\nble.  However  the  available  information  is  usually  sufficient  to  highly \nconstrain  the  outputs.  We  discuss  Bayesian  techniques  for  extracting \nclass probabilities given partial data. The optimal solution involves inte(cid:173)\ngrating  over the  missing dimensions  weighted  by  the  local  probability \ndensities.  We  show  how  to  obtain  closed-form  approximations  to  the \nBayesian  solution  using  Gaussian  basis function  networks.  The frame(cid:173)\nwork  extends  naturally  to  the case of noisy  features.  Simulations on a \ncomplex task (3D  hand gesture recognition)  validate the  theory.  When \nboth integration and weighting by input densities are used, performance \ndecreases gracefully with the number of missing or noisy features.  Per(cid:173)\nformance is substantially degraded if either step is omitted. \n\n1  INTRODUCTION \n\nThe ability to deal with missing or noisy features is vital in vision. One is often faced with \nsituations in  which  the full  set of image features  is  not computable.  In  fact,  in 3D object \nrecognition, it is highly unlikely that all features will be available. This can be due to self(cid:173)\nocclusion, occlusion  from  other objects,  shadows,  etc.  To  date  the  issue of missing  fea(cid:173)\ntures  has  not been  dealt with  in  neural  networks  in  a  systematic  way.  Instead  the  usual \npractice is to substitute a single value for the missing feature (e.g. 0,  the mean value of the \nfeature,  or a  pre-computed  value)  and  use  the  network's  output on  that  feature  vector. \n\n393 \n\n\f394 \n\nAhmad and  Tresp \n\ny \n\n5 \n\ny \n\n4 \n\n3 \n\n............  Yo \n\nx \n\nYo \n\n(a) \n\nx \n\n(b) \n\nFigure 1. The images show two possible situations for a 6-class classification problem. (Dark \nshading denotes high-probability regions.) If the value of feature x is unknown, the correct \nsolution depends both on the classification boundaries along the missing dimension and on the \ndistribution of exemplars. \n\nWhen the  features  are  known  to  be noisy,  the  usual  practice  is to just use  the  measured \nnoisy  features  directly.  The point of this paper is  to  show  that these approaches are  not \noptimal and that it is possible to do much better. \nA simple example serves to illustrate why one needs to be careful in dealing with missing \nfeatures. Consider the situation depicted in Figure 1 (a). It shows a 2 -d feature space with 6 \npossible classes.  Assume  a  network  has  already  been  trained  to  correctly classify  these \nregions.  During classification of a  novel exemplar. only feature y has been measured,  as \nYo;  the value of feature x is unknown. For each class Ci , we would like to compute p(Cily). \nSince nothing is known about x, the classifier should assign equal probability to classes 1, \n2, and 3, and zero probability to classes 4,5, and 6. Note that substituting any single value \nwill always produce the wrong result. For example, if the mean value of x is substituted, \nthe classifier would assign a probability near 1 for class 2. To obtain the correct posterior \nprobability, it is necessary to integrate the network output over all values of x.  But there is \none other fact to consider: the probability distribution over x may be highly constrained by \nthe  known  value  of feature  y.  With  a  distribution  as in  Figure 1 (b)  the classifier  should \nassign class 1 the highest probability. Thus it is necessary to integrate over x along the line \nY=Yo weighted by the joint distribution p(x,y). \n\n2  MISSING FEATURES \n\nWe first show how  the intituitive arguments outlined above for missing inputs can be for(cid:173)\nmalized using Bayes rule. Let x represent a complete feature vector. We assume the classi(cid:173)\nfier  outputs good estimates of p (Cil x)  (most reasonable classifiers do  - see  (Richard & \nLippmann, 1991\u00bb. In a given instance, x can be split up into xc' the vector of known (cer(cid:173)\ntain) features, and xu. the unknown features. When features are missing the task is to esti(cid:173)\nmate p (Cil xc) . Computing marginal probabilities we get: \n\n\fSome  Solutions to the Missing Feature  Problem in  Vision \n\n395 \n\nJp (Cil Xc'  xu) p (xc' xu) dxu \n\np (xc) \n\n(1) \n\nNote that p  (Cil XC'  xu)  is  approximated by the network output and that in order to  use (1) \neffectively we need estimates of the joint probabilities of the inputs. \n\n3  NOISY FEATURES \n\nThe missing feature  scenario can be extended to deal with noisy inputs.  (Missing features \nare simply  noisy  features  in  the limiting case of complete noise.) Let Xc  be the vector of \nfeatures measured with complete certainty,  Xu  the  vector of measured, uncertain features, \nand xtu  the true values of the features in Xu.  p (xul  XtU) denotes our knowledge of the noise \n(i.e. the probability of measuring the (uncertain) value Xu  given that the true value is xtu ). \nWe  assume that this  is  independent of Xc  and  Ci \u2022  i.e.  that  p (xul xlU ,  Xc' Ci )  = p (xul xlU )  \u2022 \n(Of course the value of xlU  is dependent on Xc  and  Cj .) We want to compute p (Cil Xc' xu) . \nThis can be expressed as: \n\n\",,. \n\np(Cjlxc,xu)  = \n\nJp (xc' xu' xtu, Ci ) dxtu \n\n~ \n\n~ \n\np (xc' xu) \n\nGiven the independence assumption, this becomes: \n\nJp (Cjl x  ,XtU ) p (xc' xtu ) p (Xul  XtU) dxtu \np(Cilxc'xu)  = ____ c __________ _ \n\nJp (xc' XtU ) p (xul  XtU ) dxtu \n\n(2) \n\n(3) \n\nAs before, p (C il Xc>  X tu)  is given by the classifier. (3)  is almost the same as (1) except that \nthe integral is also weighted by  the  noise model.  Note that in the case of complete uncer(cid:173)\ntainty about the features (i.e. the noise  is uniform over the entire range of the features), the \nequations reduce to  the miSSing feature case. \n\n4  GAUSSIAN BASIS FUNCTION NETWORKS \n\nThe  above  discussion  shows  how  to  optimally  deal  with  missing  and  noisy  inputs  in  a \nBayesian sense.  We  now  show how  these equations can be approximated using  networks \nof Gaussian basis functions  (GBF nets).  Let us consider GBF networks  where  the Gaus(cid:173)\nsians  have  diagonal covariance matrices (Nowlan,  1990).  Such  networks  have proven  to \nbe useful  in  a number of real-world applications  (e.g. Roscheisen et al,  1992). Each hid-\nden unit is characterized by a mean vector ~j and by  aj, a vector representing the diagonal \nof the covariance matrix. The network output is: \n\n4: wijbj  (x) \nYj (x)  =  -..::1 ___  _ \n\n\f396 \n\nAhmad  and Tresp \n\nwith  bj  (x)  = 1tj n (x;a.j , crJ)  = \n\nd1tj  d \n\nexp [-r (xi -':;/l \n\nI \n\n20'\u00b7\u00b7 \nJI \n\n-\n2  II~ \nO'kj \n\n(21t) \n\n(4) \n\nk \n\nwji  is  the  weight  from  the j'th basis  unit  to  the  i'th output unit,  Ttj  is  the  probability  of \nchoosing unit j, and d is  the dimensionality of x. \n\n4.1  GBF NETWORKS AND MISSING FEATURES \n\nUnder certain  training regimes  sur.h  as Gaussian mixture  modeling,  EM or \"soft cluster(cid:173)\ning\" (Duda &  Hart, 1973; Dempster et ai,  1977; Nowlan, 1990) or an approximation as in \n(Moody &  Darken,  1988) the hidden units adapt to represent local probability densities. In \nparticular  Yi (x)  \"\" p (Cil  x)  and  p (x)  \"\" Ijbj  (x)  . This is a  major advantage of this archi(cid:173)\ntectur and can be exploited to obtain closed form  solutions to (1) and (3).  Substituting into \n(3) we get: \n\nJ (L, wijbj  (xc' XtU\u00bb p (xul XtU) dXtu \np (C il xc' xu)  ==  - - - \" j - - - - - - - - - - (cid:173)\nJ (Lbj  (xc' xlU) ) p (xul xtu ) dxlu \n\nJ \n\n(5) \n\nFor the case of missing features equation (5) can be computed directly.  As  noted before, \nequation  (1)  is  simply  (3)  with  p (xui x,u) uniform.  Since  the  infinite  integral along each \ndimension of a multivariate normal density is equal to one we get: \n\n'\" w .. b \u00b7 (xc) \n4 \n\nJI  J \n\np(Cilxc)\"\"J\", \n\n3. \n\n~bj(xc) \nj \n\n(6) \n\n(Here  bj  (xc)  denotes  the  same  function  as  in  except  that it  is  only  evaluated  over  the \nknown dimensions given by xc.) Equation (6) is appealing since it gives us a simple closed \nform  solution. Intuitively, the solution is  nothing more than projecting the Gaussians onto \nthe dimensions which  are available and  evaluating the resulting  network.  As  the  number \nof training patterns increases, (6) will approach the optimal Bayes solution. \n\n4.2  GBF NETWORKS AND NOISY FEATURES \n\nWith noisy features the situation is a little more complicated and the  solution depends on \nthe  form  of the  noise.  If the  noise  is  known  to  be  uniform  in  some  region  [a, b]  then \nequation (5) becomes: \n\n'\" w iJb. (xc)  II [N (bjal .. , 0'2.)  - N (ai;~'\" O'~.)] \n\n~  J .  \n\nIJ \n\nIJ \n\nIJ \n\nIJ \n\np(C'lx,x)==  J \n\nICU   L \n\nlEV \n\n3.  II \n\nbJ.(xc) \n\n. \n\n. .   IJ \nJ \n\nI E  V \n\n[N(b\"'~ ' :7O' '' )  -N(a,. ,~ .. ,(J .. )] \n\n. \n\nIJ \n\n2 \nIJ \n\n2 \nIJ \n\n(7) \n\n\fSome Solutions to  the  Missing Feature  Problem  in  Vision \n\n397 \n\nHere ~jj and  a~ select the i'th component of the j'th mean and variance vectors. U ranges \nover the noisy  feature  indices.  Good closed form  approximations to  the  normal distribu(cid:173)\ntion function  N (x; 1.1.,  ( 2)  are available (Press et al,  1986) so (7)  is efficiently computable. \n\nWith zero-mean Gaussian noise with variance  O'~, we can also write down a closed form \nsolution. In this case we have to integrate a product of two Gaussians and end up with: \n\n= ~----- with b'j (xc' xu)  = n (xu;J..Lju'  0u + 0ju) b/xc)' \n\n~.. ..>. \n\n.... 2..>.2 \n\n.>. \n\n4, wjjb') (xc' xu) \n\nJ \n\nLb') (xc' xu) \nj \n\n5  BACKPROPAGATION NETWORKS \n\nWith  a  large  training  set,  the  outputs  of a  sufficiently  large  network  trained  with back(cid:173)\npropagation converges to the optimal Bayes a posteriori estimates (Richard & Lippmann, \n1992).  If B j (x) \nis  the  output  of  the  i'th  output  unit  when  presented  with  input  x, \nB j  (x)  \"\" p (Cj /  x)  . Unfortunately, access to the input distribution is not available with back(cid:173)\npropagation. Without prior knowledge it is reasonable to assume a uniform input distribu(cid:173)\ntion, in which case the right hand side of (3) simplifies to: \n\nJp (Cil xc' xtu)p (xul  xtu ) dxtu \np  (C -I  x  )  ==  - - - - - - - -\n\n.>. \n\nI \n\nC \n\nJp (xul xtu ) dxtu \n\n(8) \n\nThe integral can be approximated using  standard Monte Carlo techniques.  With  uniform \nnoise in the interval  [a, b]  , this becomes (ignoring normalizing constants): \n\n.>. \n\n\" b \n\np(Cjlxc) ==  JBj(Xc.Xtu)dXtu \n\n(9) \n\nWith missing features the integral in (9) is computed over the entire range of each feature. \n\n6  AN  EXAMPLE TASK: 3D HAND GESTURE RECOGNITION \n\nA simple realistic example serves to illustrate the utility of the above techniques. We con(cid:173)\nsider the task of recognizing a set of hand gestures from  single 2D images independent of \n3D orientation (Figure 2). As input, each classifier is given the 2D polar coordinates of the \nfive fingertip positions relative to the 2D center of mass of the hand (so the input space is \nlO-dimensional). Each classifier is  trained on a  training set of 4368 examples (624  poses \nfor each gesture) and tested on a similar independent test set. \n\nThe  task  forms  a  good  benchmark  for  testing  performance  with  missing  and  uncertain \ninputs.  The classification  task itself is  non-trivial.  The classifier  must learn  to  deal  with \nhands  (which  are  complex  non-rigid  objects)  and  with  perspective projection  (which  is \nnon-linear and  non-invertible).  In  fact  it is  impossible to obtain a  perfect score  since  in \ncertain poses some of the gestures  are indistinguishable (e.g.  when  the  hand is pointing \ndirectly  at the  screen).  Moreover,  the  task  is  characteristic of real  vision  problems.  The \n\n\f398 \n\nAhmad and Tresp \n\n\"five\" \n\n\"four\" \n\n\"three\" \n\n\"two\" \n\n\"one\"  \"thumbs_up\"  \"pointing\" \n\nFigure 2.  Examples of the 7 gestures used to  train the classifier.  A 3D computer model of \nthe hand is used to  generate images of the hand in various poses. For each training exam(cid:173)\nple,  we choose a 3D orientation, compute the 3D positions of the fingertips and project \nthem onto 2D. For this task we assume that the correspondence between image and model \nfeatures  are known, and that during training all feature values are always available. \n\nposition of each finger is  highly (but not completely) constrained by  the others resulting in \na  very  non-uniform  input distribution.  Finally  it is  often  easy  to  see  what the  classifier \nshould output if features are uncertain. For example suppose the real gesture is \"fi ve\" but \nfor  some reason  the  features  from  the  thumb are  not reliably computed.  In  this  case the \ngestures \"four\" and \"five\" should both get a positive probability  whereas  the  rest should \nget zero. In many such cases only a single class should get the highest score, e.g. if the fea(cid:173)\ntures for the little finger are uncertain the correct class is still \"five\". \n\nWe tried three classifiers on this task:  standard sigmoidal networks trained with backprop(cid:173)\nagation  (BP),  and  two  types  of gaussian  networks  as  described  in  .  In  the  first  (Gauss(cid:173)\nRBF), the gaussians were radial and the centers were determined using k-means clustering \nas in (Moody &  Darken,  1988).  0'2  was set to  twice the average distance of each point to \nits  nearest gaussian  (all  gaussians  had  the  same  width).  After  clustering,  1t .  was  set  to \n\nJ \n\nL k [  n (Xk~ ~< ~J~2 ] . The  output  weights  were  then  determined  using  LMS  gradient \n\nL j n(xk,llj,O'J \n\ndescent. In the second (Gauss-G), each gaussian had a unique diagonal covariance matrix. \nThe centers and  variances were  determined using gradient descent on all  the parameters \n(Roscheisen et ai,  1992). Note that with this type of training, even though gaussian hidden \nunits are used, there is no guarantee that the distribution information will be preserved. \nAll  classifiers  were  able  to  achieve  a  reasonable  performance  level.  BP with  60  hidden \nunits managed to score 95.3% and 93.3% on the training and test sets, respectively. Gauss(cid:173)\nG  with  28  hidden  units  scored 94%  and 92%.  Gauss-RBF scored 97.7%  and 91.4%  and \nrequired 2000 units  to achieve it.  (Larger numbers of hidden units led to overfitting.) For \ncomparison, nearest neighbor achieves a score of 82.4% on the test set. \n\n6.1  PERFORMANCE WITH MISSING FEATURES \n\nWe  tested the performance of each network in  the presence of missing features.  For back(cid:173)\npropagation we  used a numerical approximation  to equation (9).  For both gaussian basis \nfunction networks we used equation (6). To test the networks we randomly picked samples \nfrom  the  test set and  deleted  random  features.  We  calculated a  performance score as  the \npercentage of samples  where  the correct class  was  ranked as one of the  top  two  classes. \nFigure 3 displays the results. For comparison we also tested each classifier by substituting \nthe mean value of each missing feature and using the normal update equation. \nAs predicted by the theory the performance of Gauss-RBF using (6) was consistently bet(cid:173)\nter than the others. The fact that BP and Gauss-G performed poorly indicates that the dis(cid:173)\ntribution of the features must be taken into account. The fact that using the mean  value is \n\n\fSome Solutions to the Missing Feature Problem  in Vision \n\n399 \n\nPerformance with milling features \n\n\u00b7\n\n\u00b7 : \u00b7~ .. \"\"\n\n~ \n\" 'O \n'. \n. \n'. \n. \n.. .. v~o. .... \n\nCl&11lIII-RBF  -.--\n\nGa~; ~ \nGauls-G-MEAN  \u00b70\u00b7 . \nBP-MEAN  +. -\nRBF MEAN  \u00b7e\u00b7-\n\n~~\"\":\"'O \n4: \n\n.\" ...\u2022 \n\n'. \n\n1  2 34   5 \n\nNo.  of milling feat urel \n\n6 \n\n100 \n\n90 \n\n80 \nPerformanc \n70 \n\n60 \n\no \n\nFigure 3. The performance of various classifiers when dealing with missing features. Each \ndata point denotes an average over tOOO random samples from an independent test set. For \neach sample. random features were considered missing. Each graph plots the percentage \nof samples where the correct class was one of the top two classes. \n\ninsufficient  indicates  that  the  integration  step  must  also  be  carried  out.  Perhaps  most \nencouraging is the result that even with 50% of the features missing. Gauss-RBF ranks the \ncorrect class  among  the  top  two  90%  of the  time.  This  clearly  shows  that  a  significant \namount of information can be extracted even with a large number of missing features. \n\n6.2  PERFORMANCE WITH NOISY FEATURES \n\nWe also tested the performance of each network in the presence of noisy features. We ran(cid:173)\ndomly picked samples from  the test set and added uniform  noise to random  features.  The \nnoise interval was calculated as  [x . - 2cr ., x \u00b7 + 2cr.J  where XI\u00b7  is the feature value and  cr. is \nthe standard deviation of that feature over the  training set.  For BP we  used equation  (9) \nand for the GBF networks we used equation (7). Figure 3 displays the results. For compar(cid:173)\nison we also tested each classifier by substituting the noisy value of each noisy feature and \nusing  the  normal  update equation (RBF-N, BP-N, and  Gauss-GN).  As  with  missing fea(cid:173)\ntures, the performance of Gauss-RBF was significantly better than the others when a large \nnumber of features were noisy. \n\nI \n\nI \n\nI \n\nI \n\nI \n\n7  DISCUSSION \n\nThe results demonstrate the advantages of estimating the input distribution and integrating \nover the missing dimensions, at least on this task. They also show that good classification \nperformance alone does not guarantee good missing feature  performance.  (Both  BP and \nGauss-G performed better than Gauss-RBF on the test set.) To get the best of both worlds \none  could  use  a  hybrid  technique  utilizing  separate  density  estimators  and  classifiers \nalthough this would probably require equations (1) and (3) to be numerically integrated. \nOne way  to  improve  the  performance of BP and Gauss-G  might be to  use  a  training  set \nthat  contained  missing  features.  Given  the  unusual  distributions  that  arise  in  vision,  in \norder to guarantee accuracy such a training set should include every possible combination \n\n\f400 \n\nAhmad  and  Tresp \n\nPerformance with noilY  featurel \n\nI \n\nI \n\nI \n\nI \n\nPerformance: \n\n80  I-Gaull-RBF  ___ \nGa1181-G  0-\nBP  -+-(cid:173)\nGauu-GN  \u00b70\u00b7 . \nBP-N  + .. \nRBF-N  .\u2022.. \n\n70  I-\n\no \n\nI \n1 \n\nl \n\nl \n23 4  \nNo.  of noisy featurel \n\n1 \n\n5 \n\n-\n\n-\n\n-\n\n6 \n\nFigure 4. As in Figure 3 except that the performance with noisy features  is plotted. \nof  missing  features.  In  addition,  for  each  such  combination,  enough  patterns  must  be \nincluded  to  accurately  estimate  the  posterior density.  In  general  this  type  of training  is \nintractable  since  the  number  of combinations  is  exponential  in  the  number  of features. \nNote that if the input distribution is available (as in Gauss-RBF), then such a  training sce(cid:173)\nnario is unnecessary. \n\nAcknowledgements \n\nWe thank D. Goryn, C. Maggioni, S. Omohundro, A.  Stokke, and R. Schuster for helpful \ndiscussions, and especially B. Wirtz  for providing the computer hand model.  V.T.  is sup(cid:173)\nported in part by a grant from the Bundesministerium fUr  Forschung und Technologie. \n\nReferences \n\nA.P.  Dempster,  N.M.  Laird,  and  D.H.  Rubin.  (1977)  Maximum-likelihood  from  incomplete \ndata via the EM algorithm.f. Royal Statistical Soc. Ser.  B, 39:1-38. \nR.O.  Duda  and  P.E.  Hart.  (1973)  Pattern  Classification  and  Scene  Analysis.  John  Wiley  & \nSons, New York. \nJ. Moody and C. Darken. (1988) Learning with localized receptive fields.  In: D. Touretzky, G. \nHinton, T.  Sejnowski, editors, Proceedings of the  1988 Connectionist Models Summer School, \nMorgan Kaufmann, CA. \nS.  Nowlan.  (1990) Maximum Likelihood Competitive Learning. In: Advances in Neurallnfor(cid:173)\nmation Processing Systems 4, pages 574-582. \nW.H.  Press,  B.P.  Flannery,  S.A.  Teukolsky,  and  W.T.  Veuerling.  (1986)  Numerical  Recipes: \nThe Art of Scientific Computing, Cambridge University Press, Cambridge, UK. \nM.  D.  Richard  and  R.P.  Lippmann.  (1991)  Neural  Network  Classifiers  Estimate  Bayesian a \nposteriori Probabilities, Neural Computation, 3:461-483. \nM.  Roscheisen, R.  Hofman, and V.  Tresp. (1992) Neural Control for Rolling Mills: Incorporat(cid:173)\ning  Domain  Theories  to  Overcome  Data  DefiCiency.  In:  Advances  in  Neural  Information \nProcessing Systems 4, pages 659-666. \n\n\f", "award": [], "sourceid": 621, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}