{"title": "Features as Sufficient Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 794, "page_last": 800, "abstract": "", "full_text": "Features as  Sufficient  Statistics \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nA. Rudra t \n\nCourant Institute \n\nNew York University \narchi~cs.nyu.edu \n\nD. Geiger \u2022 \n\nCourant Institute \n\nand Center for  Neural Science \n\nNew  York  University \ngeiger~cs.nyu.edu \n\nDepartments of Psychology and Neural Science \n\nL.  Maloney t \n\nNew  York University \nItm~cns.nyu.edu \n\nAbstract \n\nAn image is often represented by a set of detected features.  We get \nan enormous compression by representing images in this way.  Fur(cid:173)\nthermore,  we  get a  representation which  is  little affected by  small \namounts  of  noise  in  the  image.  However,  features  are  typically \nchosen  in  an  ad  hoc  manner. \ntures can be obtained using sufficient statistics.  The idea of sparse \ndata representation  naturally  arises.  We  treat  the  I-dimensional \nand 2-dimensional signal reconstruction problem to make our ideas \nconcrete. \n\n\\Ve  show  how  a  good  set  of fea(cid:173)\n\n1 \n\nIntroduction \n\nConsider an image, I, that is the result of a stochastic image-formation process.  The \nprocess depends on the precise state, f, of an environment.  The image, accordingly, \ncontains information about the environmental state f, possibly corrupted by  noise. \nWe  wish to choose feature vectors \u00a2leI)  derived from the image that summarize this \ninformation  concerning the  environment.  We  are  not  otherwise  interested  in  the \ncontents  of the  image  and  wish  to  discard  any  information  concerning the  image \nthat does not depend on the environmental state f . \n\n\u00b7Supported by NSF grant 5274883  and AFOSR grants F  49620-96-1-0159 and F  49620-\n\n96-1-0028 \n\ntpartially supported by AFOSR grants F  49620-96-1-0159  and F  49620-96-1-0028 \ntSupported by NIH grant EY08266 \n\n\fFeatures as Sufficient Statistics \n\n795 \n\nWe  develop  criteria for  choosing sets of features  (based  on information theory and \nstatistical estimation theory)  that extract from the image precisely the information \nconcerning the environmental state. \n\n2 \n\nImage Formation, Sufficient Statistics and Features \n\nAs  above,  the  image  I  is  the  realization  of  a  random  process  with  distribution \nPEn1JironmentU).  We  are interested in estimating the parameters j  of the environ(cid:173)\nmental model given the image  (compare [4]).  We  assume in the sequel that  j, the \nenvironmental parameters, are themselves  a  random  vector  with  known  prior dis(cid:173)\ntribution.  Let  \u00a2J(I)  denote a  feature  vector derived from the the image I.  Initially, \nwe  assume that \u00a2J(I)  is  a deterministic function  of I. \nFor any choice  of random  variables,  X, Y,  define[2]  the  mutual  in/ormation  of X \n:Ex,Y P(X, Y)log pf*~;:;t).  The  information  about \nand  Y  to  be  M(Xj Y)  = \nthe  environmental  parameters  contained  in  the  image  is  then  M(f;I),  while  the \ninformation  about  the  environmental  parameters  contained  in the  feature  vector \n\u00a2J(I)  is  then  M(fj \u00a2J(I)).  As  a  consequence  of  the  data  processing  inequality [2] , \nM(f; \u00a2J(I))  ~  M(f; I). \nA  vector  \u00a2J(I),  'of features  is  defined  to be  sufficient  if the  inequality  above  is  an \nequality.  We  will use the terms feature  and statistic interchangeably.  The definition \nof a sufficient feature vector above is then just the usual definition of a set of jointly \nsufficient statistics[2]. \nTo summarize,  a  feature  vector  \u00a2J(I)  captures all the  information about  the envi(cid:173)\nronmental state parameters /  precisely when  it is  sufficent.  1 \nGraded  Sufficiency:  A  feature  vector  either  is  or  is  not  sufficient.  For  every \npossible  feature  vector  \u00a2J(I),  we  define  a  measure  of its  failure  to  be  sufficent: \nSuff(\u00a2J(I))  =  M(fjI) - MUj \u00a2J(I)).  This sufficency measure is always non-negative \nand  it is  zero  precisely  when  \u00a2J  is  sufficient.  We  wish  to find  feature  vectors  \u00a2J(I) \nwhere Suff(\u00a2J(I))  is close to O.  We  define  \u00a2J(I)  to be t-sufficient if Suff(\u00a2J(I))  ~ t.  In \nwhat follows,  we  will ordinarily say sufficient,  when  we  mean t-sufficient. \nThe  above  formulation  of feature  vectors  as jointly sufficient  statistics,  maximiz(cid:173)\ning the  mutual information,  M(j, \u00a2J(I)),  can be expressed  as  the  Kullback-Leibler \ndistance  between the conditional distributions, PUll) and P(fI\u00a2J(I)): \n\nE1 [D(PUII)  II  P(fI\u00a2J(I)))]  =  M(fj I) - MUj \u00a2J(I)) , \n\n(1) \nwhere  the  symbol  E1  denotes  the  expectation  with  respect  to  I,  D  denotes  the \nKullback-Leibler  (K-L)  distance,  defined  by  DUlIg) = :Ex j(x) logU(x)jg(x))  2. \nThus, we  seek feature  vectors  \u00a2J(I)  such that the conditional distributions,  PUll) \nand  PUI\u00a2J(I))  in  the  K-L sense,  averaged across the set of images.  However,  this \noptimization for each image could lead  to over-fitting. \n\n3  Sparse Data and Sufficient Statistics \n\nThe notion  of sufficient  statistics  may  be described  by  how  much  data can  be  re(cid:173)\nmoved  without increasing the K-L distance between PUI\u00a2J(I))  and PUll).  Let us \n\n1 An  information-theoretic framework  has  been adopted in neural networks  by others; \n\ne.g.,  [5]  [9][6]  [1][8].  However,  the connection  between features  and sufficiency is  new. \nproperty to say that P(f, l, \u00a2(I)) = P(l, \u00a2J(l) )P(fll, \u00a2(l)) = P(l)P(fII). \n\n2We  won't  prove  the  result  here.  The  proof  is  simple  and  uses  the  Markov  chain \n\n\f796 \n\nD.  Geiger; A. Rudra and L.  T.  Maloney \n\nformulate  the approach more precisely, and apply two  methods to solve it. \n\n3.1  Gaussian Noise Model and Sparse Data \n\nWe  are required to construct P(fII) and P(fI\u00a2(I)).  Note that according to Bayes' \nrule  P(fl\u00a2(!))  =  P(\u00a2(I)I!) P(f) j P(\u00a2(I)).  We  will  assume  that  the  form  of \nIn  order  to  obtain  P(\u00a2(I)I!)  we  write  P(\u00a2(!)l!)  = \nthe  model  P(f)  is  known. \nEJ P(\u00a2(I)II)P(II!)\u00b7 \nComputing P(fl \u00a2(!)):  Let us first assume that the generative process of the image \nI, given the model I, is Gaussian LLd.  ,Le., P(II/) = (ljV21To\"i) TIi e-(f,-Ji)2/2tT~ \nwhere  i  = 0,1, ... , N  - 1  are  the  image  pixel  index  for  an image  of size  N.  Fur(cid:173)\nther,  P(Iil/i)  is  a function  of (Ii  - Ii)  and Ii  varies from  -00 to +00,  so that the \nnormalization  constant  does  not  depend  on Ii.  Then,  P(fII)  can  be obtained  by \nnormalizing P(f)P(II!). \n\nP(fII) =  (ljZ)(II e-(fi-J;)2/2tT~)p(f), \n\ni \n\nwhere Z  is  the normalization constant. \nLet  us  introduce a  binary decision  variable Si  =  0,1, which  at every  image  pixel i \ndecides  if that image pixel  contains  \"important\"  information or not  regarding the \nmodel  I.  Our  statistic  \u00a2  is  actually  a  (multivariate)  random  variable  generated \nfrom I  according to \n\nPs(\u00a2II)  = II \n\nThis distribution gives \u00a2i  =  Ii  with probability 1 (Dirac delta function)  when Si  = \u00b0 \n(data is kept) and gives \u00a2i uniformly distributed otherwise (Si  = 1, data is removed). \nWe  then have \n\ni \n\nPs(\u00a2I/)  =  I P(\u00a2,II!)dI= I P(II!) Ps(\u00a2II) dI \n\n1  I -~(f;-Ji)2 \n\ne  20', \n\nII \ni  v21TUr \n\n=  II \n\nThe conditional distribution of \u00a2 on I  satisfies the properties that we  mentioned in \nconnection with the posterior distribution of I  on I. Thus, \n\nPs(fl\u00a2)  = \n\n(ljZs) P(f) (II e-~(f;-J;)2(1-Si)) \n\n(2) \n\nwhere Zs  is  a  normalization constant. \nIt is  also plausible to extend this model to non-Gaussian ones, by simply modifying \nthe quadratic term (fi - Ii)2  and keeping the sparse data coefficient  (1  - Si). \n\ni \n\n3.2  Two Methods \n\nWe  can  now  formulate  the problem of finding  a  feature-set,  or finding  a  sufficient \nstatistics, in terms of the variables Si  that can remove data.  More precisely, we can \nfind  S  by  minimizing \n\n\fFeatures as Sufficient Statistics \n\nE(s,I) =  D(P(fII) II Ps(fl\u00a2(l))) + A 2)1 - Si)  . \n\n797 \n\n(3) \n\nIt is  clear that the K-L distance is  minimized  when  Si  = 0 everywhere and all  the \ndata is  kept.  The second term is  added on to drive the solution towards a  minimal \nsufficient  statistic,  where  the  parameter  A has  to  be  estimated.  Note  that,  for  A \nvery large, all the data is removed  (Si  = 1),  while for  A = 0 all the data is  kept. \nWe  can further  write  (3)  as \n\nE(s,I)  =  2:P(fII) log(P(fIl)/Ps(fI\u00a2(I))) + A 2:(1- Si) \n\nI \n\n2: P(fII)log( (Zs/Z) II e -2!rUi-li)2(1-(I-Si\u00bb) + A 2:(1 - Si) \n\nI i i  \nlog-Z - Ep ~ 2u[ (h - Ii)  + A ~(1 - Si)  . \n\n2)] \n\nZs \n\n-\n\n['\"  Si \n, \n\n'\" \n, \n\nwhere E p [.]  denotes the expectation taken with respect to the distribution P. \nIT  we  let  Si  be a  continuous variable the minimum E(s, I)  will  occur when \n\naE \n\n0=  aSi  =  (Ep.[(h - Ii)  ] - Ep[(fi - Ii)  ])  - A. \n\n2 \n\n2 \n\n(4) \n\nWe  note that the Hessian matrix \nHs[i,j]  =  a~:!j = Ep.[(h - Ii)2(/j - Ij )2]  - Ep.[(h - Ii)2]Ep.[(!i  - Ij?] , (5) \n\nis a correlation matrix, i.e., it is positive semi-definite.  Consequently, E(s) is convex. \nContinuation Method on A: \nIn order to solve for  the optimal vector  S  we  consider the continuation method on \nthe parameter A.  We  know  that  S  =  0,  for  A =  O.  Then,  taking derivatives of (4) \nwith respect to A,  we  obtain \n\naSj  \"'H- 1 [ \"\naA  =  LJ \n\ns  ~,J. \n\n] \n\ni \n\nIt was  necessary  the Hessian to  be  invertible,  i.e.,  the continuation method  works \nbecause  E  is  convex.  The computations are expected  to  be  mostly  spent  on  esti(cid:173)\nmating the Hessian matrix, i.e., on computing the averages Ep. [(h - Ii)2(iJ - Ij )2], \nE p\u2022 [(h - Ii)2],  and Ep. [(fj  - Ij )2].  Sometimes these averages can be exactly com(cid:173)\nputed,  for  example  for  one  dimensional  graph lattices.  Otherwise  these  averages \ncould  be estimated via Gibbs sampling. \nThe above method  can be  very  slow,  since these  computations for  Hs  have  to  be \nrepeated at each increment in  A.  We then investigate an alternative direct method. \nA  Direct Method: \nOur approach  seeks to find  a  \"large set\"  of Si  = 1 and to maintain  a  distribution \nPs(fI\u00a2(I)) close to P(fII), i.e., to remove as many data points as possible.  For this \n\n\fD.  Geiger, A. Rudra and L.  T.  Maloney \n\n798 \n\nrj \n\no \n\n10 \n\n20 \n\n. \n\n10 \n\n(a) \n\nFigure  1:  (a).  Complete  results  for  step  edge  showing  the  image,  the  effective \nvariance and the computed s-value  (using the continuation method).  (b)  Complete \nresults for step edge with added noise. \n\ngoal, we  can investigate the marginal distribution \n\nP(fiII) \n\nf dfo .. , dh-l dfHl ... dfN-l P(fII) \n~ e -2!;(f;-I;)2 f II d/j P(f) (II e --0(/;-1;)2) \n\n-\n\nPI; (h) Pell(!;,), \n\nj#i \n(after rearranging the normalization constants) \n\nj#i \n\nwhere  Pel I (h)  is  an effective marginal distribution that depends  on all  the  other \nvalues of I  besides the one at pixel i. \nHow to decide if Si  = 0 or Si  =  1 directly from  this marginal distribution P(lilI)? \nThe entropy of the first  term HI; (fi)  = J dfiPI; (h) logPI; (h)  indicates  how  much \n!;,  is  conditioned  by  the  data.  The larger the entropy the less  the  data constrain \nIi,  thus, there is  less  need to keep this data.  The second term entropy  Hell (Ii) = \nJ dhPell(h) lo9Pel/(fi)  works the opposite direction.  The more h  is  constrained \nby  the neighbors,  the lesser the entropy  and the lesser the need  to keep  that data \npoint.  Thus,  the  decision  to  keep  the  data,  Si  =  0,  is  driven  by  minimizing  the \n\"data\" entropy HI (!;,)  and maximizing the neighbor entropy Hell (!;,).  The relevant \nquantity is Hell(h)-HI; (!;,).  When this is large, the pixel is kept.  Later, we will see \na case where the second term is constant, and so the effective entropy is maximized. \nFor Gaussian  models,  the entropy is  the logarithm  of the variance and the appro(cid:173)\npriate ratio of variances may be considered. \n\n4  Example:  Surface  Reconstruction \n\nTo make this approach concrete we  apply to the problem of surface reconstruction. \nFirst we  consider the 1 dimensional case to conclude that edges  are the important \nfeatures.  Then,  we  apply  to the two  dimensional  case  to conclude that junctions \nfollowed  by edges are the important features. \n\n\fFeatures as Sufficient Statistics \n\n799 \n\n4.1 \n\nID Case:  Edge Features \n\nVarious simplifications and manipulations can be applied for the case that the model \nf  is  described by  a  first  order Markov model, i.e.,  P(f) =  Di Pi(h, h-I).Then the \nposterior distribution is \n\nP(fII) =  ~ II e-[~(li-/i)2+\"i(li-/i-l)21, \n\ni \n\nwhere J-ti  are smoothing coefficients that may vary from  pixel to pixel according to \nhow  much  intensity change occurs  ar pixel  i, e.g.,  J-ti  =  J-tl+ P(Ii:/i-d 2  with  J-t  and \np  to  be  estimated.  We  have  assumed  that  the  standard  deviation  of the  noise  is \nhomogeneous,  to simplify  the calculations  and  analysis  of the direct  method.  Let \nus  now consider both methods, the continuation one and the direct one to estimate \nthe features. \nContinuation Method:  Here we apply ~ = 2:i H;I[i, j] by computing Hs[i,j], \ngiven  by  (5),  straight forwardly.  We  use the  Baum-Welch  method  [2]  for  Markov \nchains to exactly compute Ep\u2022 [(h-li)2(h-Ij?], Ep\u2022 [(h-li?], and Ep.[(f;-Ij)2]. \nThe final result ofthis algorithm, applied to a step-edge data (and with noise added) \nis  shown  in  Figure  1.  Not  surprisingly,  the  edge  data,  both  pixels,  as  well  as  the \ndata boundaries, were the most important data, Le.,  the features. \nDirect  Method:  We  derive  the same  result,  that edges  and  boundaries  are  the \nmost important data via an analysis of this  model.  We  use the result that \n\nP(filI)  = \n\n/  dlo ... dli- I  dli+1  ... dIN - I PUII) =  Z~e-~(Ii-/.)2 e->.[i(li- r [i)2  , \nwhere  >.t\"  is  obtained recursively,  in  log2 N  steps  (for  simplicity,  we  are assuming \nN  to be an exact power of 2), as follows \n\n>.~K  = \n~ \n\n(>.!< + \n\n~ \n\n>.K \n\n11K \n\ni+KrHK \n\n>.f + J-tf  + J-tfrK \n\n+ \n\n>.K \n\n11K \n\ni-Kri  K \n\n>'i + J-tf + J-ti-K \n\n) \n\n(6) \n\nThe effective  variance is  given  by  varel/(h)  = 1/(2)'t'')  while the data variance is \ngiven by var/(h) = (72.  Since var/(h)  does not depend on any pixel i, maximizing \nthe ratio var ell / var /  (as the direct method suggested) as equivalent to maximizing \neither the effective  variance, or the total variance  (see figure(I). \n\nThus, the lower is  >.t\"  the lower is  Si.  We  note that >.f  increases  with  K, and  J-tf \ndecreases  with  K.  Consequently  >.K  increases  less  and  less  as  K  increases.  In a \nperturbative sense  A;  most  contribute to  At\"  and  is  defined  by  the two  neighbors \nvalues  J-ti  and J-ti+I,  Le.,  by the edge information.  The larger are the intensity edges \nthe  smaller are  J-ti  and  therefore,  the smaller will  >.r  be.  Moreover,  >.t\"  is  mostly \ndefined by>.;  (in a perturbative sense, this is  where most of the contribution comes). \nThus,  we  can  argue  that the pixels  i  with  intensity edges  will  have smaller values \nfor  At\"  and therefore are likely to have the data kept  as a feature  (Si  = 0). \n\n4.2  2D  Case:  Junctions,  Corners, and Edge Features \n\nLet us  investigate the two dimensional version of the ID problem for surface recon(cid:173)\nstruction.  Let  us  assume the posterior \n\nPUll) =  .!.e-[~(lii-/ij)2+\":i(li;-/i-l.j)2+\"~j(lij-/;.j-l)21, \n\nZ \n\n\f800 \n\nD.  Geiger, A. Rudra and L.  T.  Maloney \n\nwhere  J.L~jh  are  the  smoothing coefficients  along  vertical  and  horizontal direction, \nthat vary inversely according to the 'V I  along these direction.  We can then approx(cid:173)\nimately compute (e.g., see  [3)) \n\n(f  I) \n\nij I  ~ Z  e  ~ ., \n\n1  _-L(J\u00b7\u00b7 -I \u00b7\u00b7 )2  _>.N(J\u00b7 \u00b7 _rN )2 \n\n.,  e \n\nP \n\ni j \"  \n\nij \n\nwhere,  analogously to the ID case, we  have \n\n>..K \n\nh,K \n\n>..K \n\n>..~ +  i ,j-KJ.Lij  +  i,j+KJ.Li,j+K  +  i-K,jJ.Lij  +  HK,jJ.LHK,j  (7) \n~  KKK  \nXi-K,j \n\nXi ,j-K \n\nXi,j+K \n\nXHK,j \n\nK \n\nh,K \n\n11,K \n\n11,K \n\n>..K \n\n>.. \n\nh \n\nwere Xi ,j  - Aij \n\nK  _ \n\n\\ K  +  h,K +  11,K +  h,K  +  11,K \n\nJ.Lij \n\nJ.Lij \n\nJ.Li ,HK \n\nJ.LHK,j'  an \n\nd  h,2K  _ \n-\n\nJ.Lij \n\n\" , K  h,K \n\nJl.ij \n\nJl.i.i\u00b1K \nx!' . \n' \" \n\nThe  larger  is  the effective  variance at  one  site  (i,j),  the smaller  is  >..N,  the  more \nlikely  that  image  portion  to be  a  feature.  The  larger the intensity  gradient  along \nh, v,  at (i, j), the smaller J.L~1J.  The smaller is  J.L~11  the smaller will  be contribution \nto >..2.  In a  perturbative sense  ([3))  >..2  makes the largest contribution to >..N.  Thus, \nat one site, the more intensity edges  it  has the larger will  be the effective variance. \nThus, T-junctions  will  produce  very  large effective  variances, followed  by  corners, \nfollowed  by  edges.  These  will  be,  in  order of importance,  the features  selected  to \nreconstruct 2D  surfaces. \n\n5  Conclusion \n\nWe  have  proposed  an  approach to  specify  when  a  feature  set  has  sufficient  infor(cid:173)\nmation  in  them,  so  that  we  can  represent  the  image  using  it.  Thus,  one  can,  in \nprinciple, tell  what kind of feature is  likely to be important in a  given model.  Two \nmethods  of computation  have  been  proposed and a  concrete analysis  for  a  simple \nsurface reconstruction was  carried out. \n\nReferences \n\n[1]  A.  Berger and S.  Della Pietra and V.  Della Pietra  \"A  Maximum Entropy Approach \nto  Natural  Language  Processing\"  Computational  Linguistics,  Vo1.22  (1),  pp 39-71, \n1996. \n\n[2]  T.  Cover and J. Thomas.  Elements  of Information  Theory.  Wiley  Interscience,  New \n\nYork,  1991. \n\n[3]  D.  Geiger and J. E. Kogler.  Scaling Images and Image Feature via the Renormaliza(cid:173)\n\ntion Group.  In Proc.  IEEE  Con/.  on  Computer  Vision  & Pattern  Recognition,  New \nYork,  NY,  1993. \n\n[4]  G. Hinton and Z.  Ghahramani. Generative Models for  Discovering Sparse Distributed \n\nRepresentations  To  Appear Phil.  funs .  of the  Royal Society B, 1997. \n\n[5]  R.  Linsker.  Self-Organization  in  a  Perceptual  Network.  Computer,  March  1988, \n\n105-117. \n\n[6]  J.  Principe,  U.  of Florida at Gainesville  Personal Communication \n[7]  T.  Sejnowski.  Computational  Models  and the Development of Topographic  Projec(cid:173)\n\ntions  7rends  Neurosci,  10, 304-305. \n\n[8]  S.C.  Zhu, Y .N. Wu,  D.  Mumford.  Minimax entropy principle and its application to \n\ntexture modeling  Neural  Computation 1996 B. \n\n[9]  P. Viola and W .M.  Wells  III.  \"Alignment by Maximization of Mutual Information\". \n\nIn Proceedings  of the  International  Conference  on  Computer  Vision.  Boston.  1995. \n\n\f", "award": [], "sourceid": 1334, "authors": [{"given_name": "Davi", "family_name": "Geiger", "institution": null}, {"given_name": "Archisman", "family_name": "Rudra", "institution": null}, {"given_name": "Laurance", "family_name": "Maloney", "institution": null}]}