{"title": "Learning Informative Statistics: A Nonparametnic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 900, "page_last": 906, "abstract": null, "full_text": "Learning Informative Statistics: A \n\nNonparametric Approach \n\nJohn W. Fisher III, Alexander T. IhIer, and Paul A. Viola \n\nMassachusetts Institute of Technology \n\n77 Massachusetts Ave., 35-421 \n\nCambridge, MA 02139 \n\n{jisher,ihler,viola}@ai.mit.edu \n\nAbstract \n\nWe discuss an information theoretic approach for categorizing and mod(cid:173)\neling dynamic processes. The approach can learn a compact and informa(cid:173)\ntive statistic which summarizes past states to predict future observations. \nFurthermore, the  uncertainty of the prediction is characterized nonpara(cid:173)\nmetrically by a joint density over the learned statistic and present obser(cid:173)\nvation.  We  discuss the  application of the technique to both noise driven \ndynamical systems and random processes sampled from a density which \nis conditioned on the past. In the first case we show results in which both \nthe  dynamics of random walk and the  statistics of the  driving noise  are \ncaptured.  In the second case we  present results in  which a summarizing \nstatistic  is  learned  on  noisy  random telegraph  waves  with  differing de(cid:173)\npendencies on past states.  In both cases the algorithm yields a principled \napproach for discriminating processes with differing dynamics and/or de(cid:173)\npendencies.  The method is  grounded in  ideas  from  information theory \nand nonparametric statistics. \n\n1  Introduction \n\nNoisy  dynamical  processes  abound  in  the  world  - human  speech,  the  frequency  of sun \nspots,  and  the  stock  market  are  common examples.  These  processes  can  be  difficult  to \nmodel and categorize because current observations are dependent on the past in  complex \nways.  Classical models come in  two sorts:  those that assume that the dynamics are linear \nand the  noise  is  Gaussian (e.g.  Weiner etc.); and those that assume that the dynamics are \ndiscrete (e.g.  HMM's).  These approach are wildly popular because they are tractable and \nwell  understood.  Unfortunately there are many processes where the underlying theoretical \nassumptions  of these  models  are  false.  For example  we  may  wish  to  analyze  a  system \nwith linear dynamics and non-Gaussian  noise or we  may  wish to model a system with an \nunknown number of discrete states. \n\nWe  present an  information-theoretic approach for analyzing stochastic dynamic processes \nwhich  can  model  simple processes like  those  mentioned above,  while retaining the flexi(cid:173)\nbility to model a wider range of more complex processes.  The key  insight is  that  we  can \noften learn a simplifying informative statistic of the past from samples using non parametric \nestimates of both entropy and mutual information.  Within  this  framework we  can predict \nfuture  states  and,  of equal  importance,  characterize the  uncertainty  accompanying those \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n901 \n\npredictions.  This non-parametric model  is  flexible enough to describe  uncertainty which \nis  more complex  than  second-order statistics.  In  contrast techniques  which  use  squared \nprediction error to drive learning are focused on the mode of the distribution. \nTaking  an  example from  financial  forecasting,  while  the  most  likely  sequence  of pricing \nevents is  of interest,  one  would also  like  to  know the accompanying distribution of price \nvalues (i.e.  even  if the most likely outcome is appreciation in  the price of an asset,  knowl(cid:173)\nedge of lower, but not insignificant, probability of depreciation is  also valuable). Towards \nthat end we describe an approach that allows us to  simultaneously learn the dependencies \nof the process on the past as well as the uncertainty of future states. Our approach is novel \nin  that we fold in concepts from information theory, nonparametric statistics, and learning. \n\nIn the two types of stochastic processes we will consider, the challenge is to summarize the \npast in  an  efficient way.  In  the absence of a known dynamical or probabilistic model, can \nwe learn an  informative statistic (ideally a sufficient statistic) of the past which minimizes \nour uncertainty about future states? In the classical linear state-space approach, uncertainty \nis characterized by mean squared error (MSE) which implicitly assume Gaussian statistics. \nThere are, however, linear systems with interesting behavior due to non-Gaussian statistics \nwhich violate the assumption underlying MSE. There are also nonlinear systems and purely \nprobabilistic  processes  which  exhibit complex  behavior and  are  poorly  characterized by \nmean square error and/or the assumption of Gaussian noise. \n\nOur  approach  is  applicable  to  both  types  of  processes.  Because  it  is  based  on  non(cid:173)\nparametric statistics we  characterize the uncertainty of predictions in  a very general  way: \nby a density of possible future states.  Consequently the resulting system captures both the \ndynamics of the  systems  (through  a  parameterization) and  the  statistics  of driving  noise \n(through a non parametric modeling).  The model can then  be used to classify  new signals \nand make predictions about the future. \n\n2  Learning from Stationary Processes \n\nIn this paper we will consider two related types of stochastic processes, depicted in figure  I. \nThese processes differ in how current observations are related to the past.  The first type of \nprocess, described by the following set of equations, is a discrete time dynamical (possibly \nnonlinear) system: \n\nXk =G({Xk-t}N ;Wg)+rJk \n\n(I) \nwhere,  .T k,  the  state  of the  process at  time  k,  is  a  function  of the  N  previous states  and \nthe present value of rJ.  In general the sequence {Xk}  is  not stationary (in the strict sense); \nhowever,  under fairly  mild  conditions on  {rJk},  namely  that  {rJk}  is  a  sequence of i.i.d. \nrandom variables (which we will always assume to be true), the sequence: \n\n; {xk}N={Xk, .. . , Xk-(N - l}} \n\n(2) \nis stationary. Often termed an innovation sequence, for our purpose the stationarity of 2 will \nsuffice.  This leads to a prediction framework for estimating the dynamical parameters, w g , \nof the system and to which we  will adjoin a nonparametric characterization of uncertainty. \n\n\u20ack  = Xk  - G({Xk-t}N;Wg) \n\nThe second type of process we consider is described by a conditional probability density: \n\n(3) \nIn this case it is only the conditional statistics of {Xk} that we are concerned with and they \nare, by definition, constant. \n\nXk  \"'p(xkl l{Xk-t}N) \n\n3  Learning Informative Statistics with Nonparametric Estimators \n\nWe  propose to  determine the  system  parameters  by  minimizing  the  entropy  of the  error \nresiduals for  systems  of type  (a).  Parametric entropy optimization approaches have  been \n\n\f902 \n\nJ.  W  Fisher IlI,  A.  T.  Ihler and P.  A.  Viola \n\nr  - - - - - - - - - - - - - - - -. \n\u2022  + \n.\"11.; \n\n+ \n~----l \n\n,--------\n\nI \n\n(a) \n\n(b) \n\nFigure  I:  Two related systems:  (a)  dynamical system driven  by  stationary  noise and (b) \nprobabilistic system dependent on the finite past.  Dotted box indicates source of stochastic \nprocess, while solid box indicates learning algorithm \n\nproposed  (e.g.  [4]),  the  novelty  of our approach;  however,  is  that  we  estimate entropy \nnonparametrically. That is, \n\nwhere the differential entropy integral is approximated using a function of the Parzen kernel \ndensity estimator [51  (in all  experiments we use the Gaussian kernel).  It can be shown that \nminimizing the entropy of the error residuals is equivalent to maximizing their likelihood \n[11.  In this light, the proposed criterion is seeking the maximum likelihood estimate of the \nsystem parameters using  a nonparametric description of the  noise density.  Consequently, \nwe solve for the system parameters and the noise density jointly. \n\nWhile there is  no explicit dynamical system in the second system type we do assume that \nthe conditional statistics of the observed sequence are constant (or at worst slowly changing \nfor  an on-line  learning algorithm).  In  this  case  we  desire  to minimize the  uncertainty  of \npredictions from future samples by summarizing information from the past. The challenge \nis  to  do  so  efficiently  via  a  function  of recent  samples.  Ideally  we  would  like  to  find  a \nsufficient statistic  of the  past;  however,  without an  explicit description  of the density  we \nopt instead for an informative statistic.  By informative statistic we simply mean one which \nreduces the conditional entropy of future  samples.  If the  statistic  were sufficient then  the \nmutual  information has  reached a maximum [1].  As  in  the previous case,  we  propose to \nfind  such a statistic by maximizing the nonparametric mutual information as defined by \n\narg min i (x k, F ( { x k -1 } N; W f) ) \nargmin H(Xk) + H(F({  };Wj)) - H(XbF({  };Wj))) \n\nWf \n\nWf \n\n= \n\n(5) \n\n(6) \n\n(7) \n\nBy  equation 6 this  is  equivalent to optimizing the joint and marginal entropies (which we \ndo in  practice) or, by equation 7, minimizing the conditional entropy. \n\nWe  have  previously presented two related methods for  incorporating kernel  based density \nestimators into an information theoretic learning framework [2, 3]. We chose the method of \n[3J  because it provides an  exact gradient of an  approximation to entropy, but more impor(cid:173)\ntantly can be converted into an  implicit error function  thereby reducing computation cost. \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n903 \n\n4  Distinguishing Random Walks:  An Example \n\nIn random walk the feedback function G( {Xk-l} 1)  =  Xk-l. The noise is assumed to be in(cid:173)\ndependent and identically distributed (i.i.d.).  Although the sequence,Xk,  is non-stationary \nthe  increments  (Xk- Xk-l) are  stationary.  In  this  context,  estimating  the  statistics  of the \nresiduals allows for discrimination between two random walk process with  differing noise \ndensities.  Furthermore,  as  we  will  demonstrate empiricalIy,  even  when  one  of the  pro(cid:173)\ncesses is driven  by  Gaussian noise (an  implicit assumption of the  MMSE criterion), such \nknowledge may not be sufficient to distinguish one process from another. \n\nFigure  2  shows  two  random  walk  realizations  and  their  associate  noise  densities  (solid \nlines).  One  is  driven  by  Gaussian  noise  (17k  rv  N (O, l), while  the  other  is  driven  by \nrv  1N(0.95,0.3)  + 4N( -0.95, 0.3)  (note:  both \na  bi-modal  mixture  of gaussians  ('17k \ndensities are zero-mean and  unit variance).  During learning, the process  was  modeled as \nfifth-order auto-regressive (AR5 ).  One hundred samples were drawn from a realization of \neach type and the AR parameters were estimated using the standard MMSE approach and \nthe  approach  described  above.  With  regards  to  parameter estimation,  both  methods  (as \nexpected) yield essentially the same parameters with the first  coefficient being near unity \nand the remaining coefficients being near zero. \n\nWe  are  interested  in  the  ability  to  distinguish  one  process  from  another.  As  mentioned. \nthe  current approach jointly estimates  the  parameters of the  system  as  weII  as  the  den(cid:173)\nsity  of  the  noise.  The  nonparametric  estimates  are  shown  in  figure  2  (dotted  lines). \nThese  estimates  are  then  be  used  to  compute  the  accumulated  average  log-likelihood \n(L(EI.:)  =  t I:7=110gp(:ri )  of the  residual  sequence  (Ek  ;:::;  r/k)  under  the  known  and \nlearned  densities  (figure  3).  It  is  striking  (but  not  surprising) that  L( Ek)  of the  bi-modal \nmixture under the Gaussian model (dashed lines, top) does not differ significantly from the \nGaussian  driven  increments process (solid  lines,  top).  The explanation follows  from  the \nfact that \n\n(8) \n\nis the true density of \u20ac  (bi-modal), p( \u20ac) \n\nwhere Pf (\u20ac) \nis the assumed density of the likelihood \ntest  (unit-variance  Gaussian),  and  D( II)  is  the  KuIlback-Leibler divergence  [I).  In  this \ncase,  D(p(E)l lpf( E))  is  relatively  small  (not true  for  D(Pf (C) ll p(E\u00bb)  and  H(Pf (C))  is  less \nthan  the  entropy  of the  unit-variance Gaussian  (for fixed  variance,  the  Gaussian  density \nhas  maximum  entropy).  The consequence is  that  the  likelihood  test  under the  Gaussian \nassumption does  not reliably distinguish the two processes.  The likelihood test  under the \nbi-modal density or its nonparametric estimate (figure 3, bottom) does distinguish the two. \n\nThe  method described  is  not limited  to  linear dynamic  models.  It can  certainly  be  used \nfor  nonlinear models,  so long as  the dynamic can  be  well  approximated by differentiable \nfunctions.  Examples for multi-layer perceptrons are described in  [3]. \n\n5  Learning the Structure of a Noisy Random Telegraph Wave \n\nA noisy random telegraph wave (RTW) can be described by figure  1 (b).  Our goal is  not to \ndemonstrate that we can analyze random telegraph waves, rather that we can robustly learn \nan informative statistic of the past for such a process.  We  define a noisy random telegraph \nwave as a sequence Xk  rv N (J.Lk, (J)  where 11k  is binomially distributed: \n\n{\u00b1  }  P{ \n\nJ.Lk  E \n\nJ.L \n\n_ \nJ.Lk  -\n\n} _ \n\n1 *,~;V= l x k-, 1 \n\n- a *' ~!I IX k - . I' \n\n-J.Lk-l \n\n(9) \n\nN (J.Lk , (J)  is  Gaussian  and a  <  1.  This process is  interesting because the  parameters are \nrandom functions of a nonlinear combination of the set  {Xk} N.  Depending on the value of \nN, we observe different switching dynamics.  Figure 4 shows examples of such signals for \n\n\f904 \n\nJ.  W  Fisher III,  A.  T.  Ihler and P.  A.  Viola \n\n201 \n\n400 \n\n101 \n\n~ ~l2SJ \n~ ~lAKJ \n\n... \n\n1000 \n\n: \n\n, \n\n' \n\n-I \n\n~IO\n\n0.00 \n\n1000 \n\nIII \n\n400 \n\n\\ \n\nIt \n\n100 \n\n-I \n\n0 \n\nI \n\n\u2022 \n\no \n\n_ \n\nD \n\nFigure 2:  Random walk examples (left), comparison of known to learned densities (right). \n\nFigure 3:  L(\u20ack)  under known models (left) as compared to learned models (right). \n\nN  = 20 (left) and N  = 4 (right).  Rapid switching dynamics are possible for both signals \nwhile N  = 20 has periods with longer duration than N  = 4. \n\nFigure 4:  Noisy random telegraph wave:  N  =  20 (left), N  =  4 (right) \n\nIn our experiments we learn a sufficient statistic which has the form \n\nF({x.}past) ~ q  (t W/;Xk-.) , \n\n(to) \n\nwhere u(  ) is  the hyperbolic tangent function  (i.e.  P{  } is a one layer perceptron).  Note \nthat a multi-layer perceptron could also be used [3]. \n\nIn  our experiments  we  train  on  100 samples of noisy  RTW(N=2o)  and RTW(N=4).  We \nthen  learn statistics for each type of process using M  =  {4, 5,15,20, 25}.  This tests  for \nsituations in which the depth is both under-specified and over-specified (as well as perfectly \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n905 \n\nFigure 5:  Comparison of Wiener filter (top) non parametric approach (bottom) for synthesis . \n\n...... ~ \u2022 \u2022  P \n\nFigure 6:  Informative Statistics for noisy random telegraph waves.  M  =  25 trained on N \nequal 4 (left) and 20 (right). \n\nspecified).  We  will  denote FN({Xk}M)  as the statistic which was  trained on  an  RTW(N) \nprocess with a memory depth of M. \nSince  we implicitly  learn a joint density over (Xk, FN( {Xk} M))  synthesis is  possible  by \nsampling from that density.  Figure 5 compares synthesis using the described method (bot(cid:173)\ntom) to a Wiener filter (top) estimated over the same data. The results using the information \ntheoretic approach (bottom) preserve the structure of the RTW  while the Wiener filter re(cid:173)\nsults do not. This was achieved by collapsing the information of past samples into a single \nstatistic  (avoiding  high dimension density  estimation).  Figure  6  shows  the joint density \nover  (Xk, F N ( {Xk} M ))  for  N  =  {4, 20}  and  M  =  25.  We  see  that the estimated  den(cid:173)\nsities are not separable and by  virtue of this fact the learned statistic conveys information \nabout the future.  Figure 7 shows results from  100 monte carlo trials.  In this case the depth \nof the statistic is matched to the process. Each plot shows the accumulated conditional log \nlikelihood (L(f.k)  = i E:=1 10gp(XiIFN( {Xk-l} M)) under the learned statistic with error \nbars. Figure 8 shows similar results after varying the memory depth M  =  {4, 5,15,20, 25} \nof the statistic.  The figures illustrate robustness to choice of memory depth M. This is not \nto say that memory depth doesn't matter; that is, there must be some information to exploit, \nbut the empirical results indicate that useful information was extracted. \n\n6  Conclusions \n\nWe  have  described  a  nonpararnetric  approach for  finding  informative  statistics.  The ap(cid:173)\nproach is novel in  that learning  is  derived from nonpararnetric estimators of entropy and \nmutual  information.  This  allows  for  a  means  by  which  to  1)  efficiently  summarize  the \npast, 2) predict the future and 3) characterize the  uncertainty of those predictions beyond \nsecond-order statistics.  Futhermore, this was accomplished without the strong assumptions \naccompanying parametric approaches. \n\n\f906 \n\n1.  W  Fisher Ill, A.  T.  Ihler and P.  A.  Viola \n\nFigure 7:  Conditional L(\u20ack).  Solid line indicates RTW(N=20)  while dashed line indicates \nRTW(N=4).  Thick lines indicate the average over all monte carlo runs while the thin lines \nindicate \u00b11 standard deviation.  The left plot uses a statistic trained on RTW(N=20)  while \nthe right plot uses a statistic trained on RTW(N=4). \n\n.\" \u2022 \u2022 \u2022 \u2022  ,,-==--= \n\n1.'~\n:~:\"\"~Z= \n... \n\nO.OO~--;\"'~--;I\"'~---;I-!;_;------;_;!;;---='\" \n\nFigure 8: Repeat of figure 7 for cases with M  =  {4, 5, 15,20, 25}. Obvious breaks indicate \na new set of trials \n\nWe also presented empirical results which illustrated the utility of our approach. The exam(cid:173)\nple of random walk served as a simple illustration in learning a dynamic system in  spite of \nthe over-specification of the AR model.  More importantly, we demonstrated the ability to \nlearn both the dynamic and the statistics of the underlying noise process. This information \nwas  later used to distinguish realizations by their non parametric densities,  something not \npossible using MMSE error prediction. \n\nAn even more compelling result were the experiments with noisy random telegraph waves. \nWe demonstrated the algorithms ability to learn a compact statistic which efficiently sum(cid:173)\nmarized the past for process identification. The method exhibited robustness to the number \nof parameters of the learned statistic.  For example, despite overspecifying the dependence \nof the memory-4 in three of the cases, a useful statistic was still found. Conversely, despite \nthe memory-20 statistic being underspecified in  three of the experiments, useful  informa(cid:173)\ntion from the available past was extracted. \nIt is  our  opinion  that this  method provides an  alternative  to  some  of the  traditional  and \nconnectionist approaches to time-series analysis. The use of nonparametric estimators adds \nflexibility to the class of densities which can be modeled and places less of a constraint on \nthe exact form of the summarizing statistic. \n\nReferences \n\n[1]  T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, New York,  199]. \n[2]  P.  Viola et al.  Empricial entropy manipulation for real  world problems.  In  Mozer Touretsky and \n\nHasselmo, editors, Advances in Neural Information ProceSSing Systems, pages ?-?,  ] 996. \n\n[3]  J.w. Fisher and J.e. Principe.  A  methodology  for  information  theoretic  feature extraction.  In \n\nA. Stuberud, editor, Proc. of the IEEE Int loint Conf on Neural Networks, pages ?-?, ] 998. \n\n[4]  1.  Kapur and H.  Kesavan.  Entropy Optimization Principles with Applications.  Academic Press, \n\nNew York,  ] 992. \n\n[5]  E.  Parzen.  On  estimation  of a  probability  density  function  and  mode.  Ann.  of Math  Stats., \n\n33:1065-1076,  1962. \n\n\f", "award": [], "sourceid": 1765, "authors": [{"given_name": "John", "family_name": "Fisher III", "institution": null}, {"given_name": "Alexander", "family_name": "Ihler", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}