{"title": "Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 400, "page_last": 406, "abstract": null, "full_text": "Modeling High-Dimensional Discrete Data with \n\nMulti-Layer Neural Networks \n\nYoshua Bengio \n\nDept.IRO \n\nUniversite de Montreal \n\nMontreal, Qc, Canada, H3C  317 \n\nbengioy@iro.umontreal.ca \n\nSamy Bengio * \n\nIDIAP \n\nCP 592, rue du Simplon 4, \n1920 Martigny, Switzerland \n\nbengio@idiap.ch \n\nAbstract \n\nThe curse of dimensionality is  severe when modeling high-dimensional \ndiscrete data:  the number of possible combinations of the variables ex(cid:173)\nplodes exponentially.  In  this  paper we propose a  new  architecture for \nmodeling high-dimensional data that requires resources (parameters and \ncomputations) that grow only at most as the square of the number of vari(cid:173)\nables,  using a multi-layer neural  network to represent the joint distribu(cid:173)\ntion of the variables as the product of conditional distributions. The neu(cid:173)\nral  network can be interpreted as  a graphical model  without hidden ran(cid:173)\ndom variables, but in which the conditional distributions are tied through \nthe hidden units. The connectivity of the neural network can be pruned by \nusing dependency tests between the variables. Experiments on modeling \nthe distribution of several discrete data sets show statistically significant \nimprovements over other methods such as  naive Bayes and comparable \nBayesian networks,  and show  that significant improvements can be ob(cid:173)\ntained by pruning the network. \n\n1  Introduction \nThe curse of dimensionality hits particularly hard on models of high-dimensional discrete \ndata because there are many more possible combinations of the values of the variables than \ncan possibly  be observed in  any  data  set,  even the large data  sets  now  common in  data(cid:173)\nmining applications.  In  this  paper we  are dealing in  particular with  multivariate discrete \ndata,  where one tries to build a model of the distribution of the data.  This can be used for \nexample to detect anomalous cases in data-mining applications, or it can be used to model \nthe class-conditional distribution of some observed variables in  order to  build a classifier. \nA  simple  multinomial  maximum  likelihood  model  would  give  zero  probability  to  all  of \nthe combinations  not encountered in  the training  set,  i.e.,  it would most likely  give zero \nprobability to  most out-of-sample test cases.  Smoothing the model by assigning the same \nnon-zero probability for all  the unobserved cases would not be satisfactory either because \nit would not provide much generalization from  the training set.  This could be obtained by \nusing a multivariate multinomial model whose parameters B are estimated by the maximum \na-posteriori (MAP) principle, i.e., those that have the greatest probability, given the training \ndata D, and using a diffuse prior PCB)  (e.g.  Dirichlet) on the parameters. \n\nA graphical  model or Bayesian network [6,  5)  represents the joint distribution of random \nvariables Zl ... Zn with \n\nP(ZI ... Zn)  = II P(ZiIParentsi) \n\nn \n\ni=l \n\n\u00b0Part of this work was done while S.B.  was at CIRANO, Montreal, Qc. Canada. \n\n\fModeling High-Dimensional Discrete Data with Neural Networks \n\n401 \n\nwhere Parentsi  is  the set of random  variables  which are called the parents of variable i \nin the graphical model because they  directly  condition  Zi, and an  arrow  is  drawn,  in  the \ngraphical model, to Zi, from each of its parents.  A fully connected \"left-to-right\" graphical \nmodel is illustrated in Figure 1 (left), which corresponds to the model \n\nP(ZI . .. Zn)  = II P(ZiIZl ... Zi-r) . \n\nn \n\ni = l \n\n(1) \n\nFigure 1:  Left: a fully connected \"left-to-right\" graphical model. \nRight:  the architecture of a neural network that simulates a ful1y connected \"left-to-right\" \ngraphical model.  The  observed values  Zi  =  Zi  are encoded in  the corresponding input \nunit group.  hi  is  a group of hidden  units.  gi  is  a group  of output units,  which  depend \non  Zl  ... Zi -l ,  representing  the parameters of a distribution  over Zi.  These conditional \nprobabilities P(ZiIZl . . . Zi-r) are multiplied to obtain the joint distribution. \n\nNote that this representation depends on the ordering of the variables (in that all  previous \nvariables  in  this  order are  taken  as  parents).  We  call  each  combination of the  values  of \nParentsi a context. In the \"exact\" model (with the full table of all possible contexts) all the \norders are equivalent, but if approximations are used, different predictions could be made \nby different models assuming different orders. \n\nIn  graphical models, the curse of dimensionality shows up in  the representation of condi(cid:173)\ntional distributions P(Zi IParentsi) where Zi has many parents. If Zj  E Parentsi can take \nnj values, there are TI j  nj different contexts which can occur in  which one would like to \nestimate the distribution of Zi.  This serious problem has been addressed in the past by two \ntypes of approaches, which are sometimes combined: \n\n1.  Not modeling all the dependencies between all the variables:  this is the approach mainly \ntaken  with most graphical models or Bayes networks [6, 5] .  The set of independencies \ncan be assumed using a-priori or human expert knowledge or can be learned from data. \nSee  also  [2]  in  which  the  set  Parentsi  is  restricted  to  at  most one element,  which  is \nchosen to maximize the correlation with Zi. \n\n2.  Approximating the mathematicalform of the joint distribution with a form that takes only \n\ninto account dependencies of lower order, or only takes into account some of the possi(cid:173)\nble dependencies, e.g., with the Rademacher-Walsh expansion or multi-binomial [1,3], \nwhich is a low-order polynomial approximation of a full joint binomial distribution (and \nis used in the experiments reported in this paper). \n\nThe approach  we are  putting forward  in  this  paper is  mostly  of the  second  category,  al(cid:173)\nthough we are using simple non-parametric statistics of the dependency between pairs of \nvariables to further reduce the number of required parameters. \n\nIn the multi-binomial model [3], the joint distribution of a set of binary variables is approx(cid:173)\nimated by  a polynomial.  Whereas the \"exact\" representation of P( Zl  = Z l ,  ... Zn  = zn) \nas  a function of Z l  . . . Zn is a polynomial of degree n, it can be approximated with a lower \n\n\f402 \n\nY.  Bengio and S.  Bengio \n\ndegree polynomial, and this approximation can be easily computed using the Rademacher(cid:173)\nWalsh  expansion  [1]  (or  other  similar  expansions,  such  as  the  Bahadur-Lazarsfeld  ex(cid:173)\npansion  [1]).  Therefore,  instead  of having  2n  parameters,  the  approximated  model  for \nP(Zl , . . . Zn)  only requires O(nk) parameters. Typically, order k = 2 is used.  The model \nproposed  here  also requires  O(n 2 )  parameters, but it  allows  to  model  dependencies be(cid:173)\ntween tuples of variables, with  more than 2 variables at a time. \n\nIn  previous related  work by Frey  [4],  a fully-connected graphical model is  used (see Fig(cid:173)\nure 1, left) but each of the conditional distributions is represented by a logistic, which take \ninto account only first-order dependency between the variables: \nL \n\nP(Zi  =  llZl ... Zi-d = \n\nZ  )' \n\n1 \n1 + exp  -Wo  -\n\n( \n\nj<i Wj \n\nj \n\nIn  this  paper,  we  basically  extend Frey's  idea  to  using  a  neural  network  with  a  hidden \nlayer, with  a particular architecture, allowing multinomial or continuous variables, and we \npropose to  prune down  the  network  weights.  Frey  has  named  his  model  a  Logistic Au(cid:173)\ntoregressive Bayesian Network or LARC. He argues that the prior variances on the logistic \nweights (which  correspond to  inverse weight decays)  should be chosen inversely  propor(cid:173)\ntional  to the number of conditioning variables (i.e.  the number of inputs to the particular \noutput neuron).  The model was  tested on a task of learning to classify digits from 8x8 bi(cid:173)\nnary pixel images.  Models with different orderings of the variables were compared and did \nnot yield significant differences in performance.  When averaging the predictive probabili(cid:173)\nties from  10 different models obtained by considering 10 different random orderings, Frey \nobtained small improvements in  likelihood but not in  classification.  The model performed \nbetter or equivalently to other models tested: CART, naive Bayes, K-nearest neighbors, and \nvarious Bayesian  models  with  hidden  variables (Helmholtz machines).  These results are \nimpressive, taking into account the simplicity of the LARC model. \n\n2  Proposed Architecture \nThe  proposed  architecture  is  a  \"neural  network\"  implementation  of a  graphical  model \nwhere all the variables are observed in the training set, with the hidden units playing a sig(cid:173)\nnificant role to share parameters across different conditional distributions. Figure 1 (right) \nillustrates the model in the simpler case of a fully connected (Ieft-to-right) graphical model \n(Figure 1, left). The neural network represents the parametrized function \n\njo(zt, . .. , zn)  = log(?O(Zl  =  Zl,\u00b7 \u00b7  ., Zn  =  zn)) \n\n(2) \napproximating the joint distribution of the variables, with parameters 0 being the weights of \nthe neural network.  The architecture has three layers, with each layer organized in groups \nassociated to  each of the variables.  The above log-probability is  computed as  the sum of \nconditional log-probabilities \n\njO(Zl , . .. , zn)  =  L 109(P(Zi  =  zilgi(zl, .. . , Zi-l))) \n\nn \n\ni=l \n\nwhere gi( Zt, . .. , zi-d is  the  vector-valued output of the  i-th group  of output units,  and \nit  gives  the  value  of  the  parameters  of the  distribution  of  Zi  when  Zl  =  Zl , Z2  = \nZ2,  .. . , Zi-l  =  Zi-l'  For example,  in  the  ordinary discrete case,  gi  may  be the vector \nof probabilities  associated  with  each  of the  possible  values  of the  multinomial  random \nvariable Zi. In this case, we have \n\nIn this example, a softmax output for the i-th group may be used to force these parameters \nto be positive and sum to  1, i.e., \n\nP(Zi = i'lgi) = gi ,i' \n\ngi ,i'  = \n\ng' \n\nLil e  i ,i' \n\n\fModeling High-Dimensional Discrete Data with Neural Networks \n\n403 \n\nwhere  g~ i'  are  linear combinations of the  hidden  units  outputs,  with  i'  ranging over the \nnumber of elements of the parameter vector associated  with  the distribution of Zi  (for a \nfixed  value of Zl  ... Zi-l).  To  guarantee that the functions  gi(Zl, ... , Zi-l) only depend \non Zl  ... Zi-l and not on any of Zi  ... Zn, the connectivity struture of the hidden units must \nbe constrained as follows: \n\ng~,i'  =  bi,i' + 2: 2: Wi,i' ,j,j' hj,j' \n\nmj \n\nj~i j'=1 \n\nwhere the b's  are  biases  and the w's are  weights  of the output layer,  and the  hj,j'  is  the \noutput of the j'-th unit (out of mj such units)  in the j-th group of hidden layer nodes.  It \nmay be computed as follows: \n\nhj ,j'  =  tanh(cj,j' + 2: 2: Vj ,j' ,k ,k' Zk ,k') \n\nnk \n\nk<j k'=l \n\nwhere the c's are biases and the v's are the weights of the hidden layer,  and  Zk,k'  is  k'-th \nelement of the vectorial  input representation of the value  Zk  =  Zk.  For example,  in  the \nbinary case (Zi  =  0 or 1) we have used only one input node, i.e., \n\nZi  binomial  -t Zi,O  = Zi \nand in the multinomial case we use the one-hot encoding, \n\nZi  E  {O, 1, ... ni  -\n\nI}  -t Zi ,i'  =  8Zi ,i' \n\nwhere  8i ,i'  = 1 \nif i  = i'  and  0  otherwise.  The input  layer has  n  - 1  groups because \nthe  value  Zn  =  Zn  is  not  used  as  an  input.  The  hidden  layer  also  has  n  - 1  groups \ncorresponding to the variables j  =  2 to n  (since P(Z.) is  represented unconditionally in \nthe first output group, its corresponding group does not need any hidden units or inputs, but \njust has biases). \n\n2.1  Discussion \nThe number of free parameters of the model is O(n 2 H)  where H  =  maXi mj is the maxi(cid:173)\nmum number of hidden units per hidden group (i.e., associated with one of the variables). \nThis is basically quadratic in  the number of variables, like the multi-binomial approxima(cid:173)\ntion that uses a polynomial expansion of the joint distribution.  However, as H  is increased, \nrepresentation theorems for neural networks suggest that we should be able to approximate \nwith  arbitrary precision the true joint distribution.  Of course the true limiting factor is  the \namount of data, and H  should be tuned according to the amount of data.  In our experiments \nwe have used cross-validation to choose a value of mj  =  H  for all  the hidden groups.  In \nthis sense, this neural network representation of P(ZI ... Zn)  is to the polynomial expan(cid:173)\nsions  (such as  the multi-binomial) what ordinary  multilayer neural  networks for function \napproximation are to polynomial function  approximators.  It  allows  to  capture high-order \ndependencies,  but  not  all  of them.  It  is  the  number of hidden  units  that  controls  \"how \nmany\" such dependencies will  be captured, and it is  the data that \"chooses\" which of the \nactual dependencies are most useful in maximizing the likelihood. \n\nUnlike Bayesian networks with hidden random variables, learning with the proposed archi(cid:173)\ntecture is very simple, even when there are no conditional independencies. To optimize the \nparameters we  have simply  used  gradient-based optimization methods,  either using  con(cid:173)\njugate or stochastic  (on-line)  gradient,  to  maximize the  total  log-likelihood which  is  the \nsum of values of f  (eq.  2)  for the training examples.  A prior on the parameters can be \nincorporated in the cost function and the MAP estimator can be obtained as easily, by max(cid:173)\nimizing  the total  log-likelihood plus the log-prior on  the parameters.  In  our experiments \nwe have used a \"weight decay\" penalty inspired by the analysis of Frey [4], with a penalty \nproportional to the number of weights incoming into a neuron. \n\n\f404 \n\nY.  Bengio and S.  Bengio \n\nHowever,  it  is  not so  clear  how  the distribution could be generally  marginalized,  except \nby summing over possibly many combinations of the values of variables to be integrated. \nAnother related question is whether one could deal with missing values: if the total number \nof values that the  missing  variables can take  is  reasonably small,  then one can sum  over \nthese  values  in  order to  obtain  a  marginal  probability  and  maximize this  probability.  If \nsome variables have more systematically missing values, they can be put at the end of the \nvariable ordering, and in this case it is very easy to compute the marginal distribution (by \ntaking only the product of the output probabilities up to the missing variables).  Similarly, \none can easily compute the predictive distribution of the last variable given the first n  - 1 \nvariables. \n\nThe framework can  be easily  extended to  hybrid  models  involving both  continuous and \ndiscrete variables. In the case of continuous variables, one has to choose a parametric form \nfor  the  distribution of the  continuous variable when  all  its  parents (i.e.,  the conditioning \ncontext) are fixed.  For example one could use a normal, log-normal, or mixture of normals. \nInstead  of having  softmax  outputs,  the i-th output group  would  compute the parameters \nof this continuous distribution  (e.g.,  mean and  log-variance).  Another type of extension \nallows  to  build  a  conditional  distribution,  e.g.,  to  model  P(ZI ... ZnlXl ... Xm).  One \njust adds extra input units to represent the values of the conditioning variables Xl ... X m . \nFinally,  an  architectural  extension that  we have implemented is  to  allow  direct input-to(cid:173)\noutput connections (still  following the rules of ordering which allow gi  to depend only on \nZl  ... Zi-l). Therefore in the case where the number of hidden units is 0 (H =  0) we obtain \nthe LARC model proposed by Frey [4]. \n\n2.2  Choice of topology \nAnother type of extension of this  model  which  we have found  very  useful  in our experi(cid:173)\nments is to allow the user to choose a topology that is not fully connected (Ieft-to-right). In \nour experiments we have used non-parametric tests  to heuristically eliminate some of the \nconnections in the network, but one could also use expert or prior knowledge, just as with \nregular graphical models, in order to cut down on the number of free parameters. \n\nIn  our  experiments  we  have  used  for  a  pairwise  test  of  statistical  dependency  the \nKolmogorov-Smirnov statistic  (which  works  both for continuous and discrete variables). \nThe statistic for variables X  and Y  is \n\ns  =  Jl sup IP(X :::;  Xi, Y  :::;  Yi)  - P(X :::;  Xi)P(Y :::;  Yi) I \n\ni \n\nwhere l is the number of examples and P is the empirical distribution (obtained by counting \nover the training data).  We  have ranked the pairs according to their value of the statistic s, \nand we  have chosen those pairs for which the value of statistic is  above a threshold value \ns*, which was chosen by cross-validation. When the pairs {(Zi' Zj)} are chosen to be part \nof the model, and assuming without loss of generality that i  < j  for those pairs, then the \nonly  connections that  are  kept in  the network (in  addition  to  those from  the  k-th  hidden \ngroup to the k-th output group) are those from hidden group i  to output group j, and from \ninput group i  to hidden group j, for every such (Zi' Zj) pair. \n\n3  Experiments \nIn the experiments we have compared the following models: \n\n\u2022  Naive Bayes:  the likelihood is obtained as a product of multinomials (one per variable). \n\nEach multinomial is smoothed with a Dirichlet prior. \n\n\u2022  Multi-Binomial  (using  Rademacher-Walsh  expansion of order  2)  [3].  Since this  only \n\nhandles the case of binary data, it was only applied to the DNA data set. \n\n\u2022  A  simple graphical model with  the same pairs of variables and  variable ordering as se(cid:173)\n\nlected for the neural network, but in which each of the conditional distribution is modeled \n\n\fModeling High-Dimensional Discrete Data with Neural Networks \n\n405 \n\nby  a separate multinomial for each of the conditioning context.  This works only  if the \nnumber of conditioning variables is  small so in the Mushroom, Audiology, and Soybean \nexperiments we had to reduce the number of conditioning variables (following the order \ngiven by the above tests).  The multinomials are also smoothed with a Dirichlet prior. \n\n\u2022  Neural  network:  the  architecture described  above,  with  or  without  hidden  units  (i.e., \n\nLARC), with or without pruning. \n\n5-fold  cross-validation  was  used  to  select  the  number of hidden  units  per  hidden  group \nand  the  weight decay  for  the neural  network and LARC.  Cross-validation  was  also  used \nto  choose  the  amount  of pruning  in  the  neural  network  and  LARC,  and  the  amount  of \nsmoothing in the Dirichlet priors for  the  muItinomials  of the  naive Bayes  model  and  the \nsimple graphical model. \n\n3.1  Results \nAll four data sets were obtained on the web from the VCI Machine Learning and STATLOG \ndatabases. Most of these are meant to be for classification tasks but we have instead ignored \nthe classification and used the data to  learn a probabilistic model of all  the input features. \n\n\u2022  DNA (from STATLOG): there are 180 binary features.  2000 cases were used for training \n\nand cross-validation, and  1186 for testing. \n\n\u2022  Mushroom  (from  VCI):  there  are  22  discrete  features  (taking  each  between  2 and  12 \n\nvalues). 4062 cases were used for training and cross-validation, and 4062 for testing. \n\n\u2022  Audiology (from VCI):  there are 69 discrete features  (taking each between 2 and 7 val(cid:173)\n\nues).  113 cases are used for training and  113 for testing (the original train-test partition \nwas  200 + 26 and we concatenated and  re-split the data to  obtain more significant test \nfigures). \n\n\u2022  Soybean (from VCI): there are 35 discrete features (taking each between 2 and 8 values). \n\n307 cases are used for training and 376 for testing. \n\nTable  1 clearly  shows that the proposed model  yields promising results  since the pruned \nneural network was  superior to all  the other models in all  4 cases, and the pairwise differ(cid:173)\nences  with  the other models are statistically  significant in  all  4  cases  (except Audiology, \nwhere the difference with the network without hidden units, LARC, is  not significant). \n\n4  Conclusion \nIn this paper we have proposed a new application of multi-layer neural networks to the mod(cid:173)\nelization  of high-dimensional distributions,  in  particular for  discrete data  (but the model \ncould also be applied to continuous or mixed discrete / continuous data).  Like the polyno(cid:173)\nmial expansions [3] that have been previously proposed for handling such high-dimensional \ndistributions, the model approximates the joint distribution with a reasonable (O( n 2 )) num(cid:173)\nber of free parameters but unlike these it allows to capture high-order dependencies even \nwhen the number of parameters is  small.  The model can also  be seen as  an  extension of \nthe previously proposed auto-regressive logistic Bayesian network [4],  using hidden units \nto capture some high-order dependencies. \n\nExperimental results on four data sets  with  many discrete variables are very encouraging. \nThe comparisons were made with  a naive Bayes model, with a multi-binomial expansion, \nwith  the LARC  model and with a simple graphical model, showing that a neural network \ndid significantly better in terms of out-of-sample log-likelihood in all cases. \n\nThe approach to  pruning the neural  network used  in  the experiments,  based  on  pairwise \nstatistical  dependency tests,  is  highly  heuristic and better results  might be obtained using \napproaches that take  into account the higher order dependencies when selecting the con(cid:173)\nditioning variables.  Methods based on  pruning the fully  connected network (e.g.,  with  a \n\"weight elimination\" penalty)  should  also  be  tried.  Also,  we  have  not tried  to  optimize \n\n\f406 \n\nY  Bengio and S.  Bengio \n\nnaive Bayes \nmulti-Binomial order 2 \nordinary graph.  model \nLARC \nprunedLARC \nfull-conn.  neural net. \npruned neural network \n\nnaive Bayes \nmulti-Binomial order 2 \nordinary graph.  model \nLARC \nprunedLARC \nfull-conn.  neural net. \npruned neural network \n\nDNA \n\nMushroom \n\nmean (stdev) \n100.4 (.18) \n117.8(.01) \n108.1  (.06) \n83.2 (.24) \n91.2(.15) \n120.0 (.02) \n82.9 (.21) \n\np-value  mean (stdev) \n47.00 (.29) \n<le-9 \n<le-9 \n<le-9 \n7e-5 \n<le-9 \n<le-9 \n\n44.68 (.26) \n42.51 (.16) \n43.87 (.13) \n33.58 (.01) \n31.25 (.04) \n\np-value \n<le-9 \n\n<le-9 \n<le-9 \n< le-9 \n<le-9 \n\nAudiology \n\nSoybean \n\nmean (stdev) \n36.40 (2.9) \n\np-value  mean (stdev) \n34.74 (1.0) \n<le-9 \n\np-value \n<le-9 \n\n16.56 (.48) \n17.69 (.65) \n16.69 (.41) \n17.39 (.58) \n16.37 (.45) \n\n6.8e-4 \n<le-9 \n0.20 \n<le-9 \n\n43 .65 (.07) \n16.95 (.35) \n19.06 (.43) \n21.65 (.43) \n16.55 (.27) \n\n<le-9 \n5.5e-4 \n<le-9 \n<le-9 \n\nTable  1:  Average out-of-sample negative log-likelihood obtained with  the  various models \non  four  data  sets  (standard  deviations  of the  average  in  parenthesis  and  p-value  to  test \nthe null hypotheses  that a model has same true  generalization error as  the  pruned neural \nnetwork).  The pruned neural network was  better than all  the other models in in all  cases, \nand  the pair-wise difference is  always  statistically significant (except with  respect  to  the \npruned LARC on Audiology). \n\nthe  order of the variables,  or combine different networks  obtained  with  different orders, \nlike [4] . \n\nReferences \n\n[1]  RR Bahadur.  A representation of the joint distribution of responses to n dichotomous \nIn  ed.  H.  Solomon,  editor,  Studies  in  Item  Analysis  and Predictdion,  pages \n\nitems. \n158-168. Stanford University Press, California, 1961. \n\n[2]  c.K.  Chow.  A  recognition  method  using  neighbor dependence. \n\nComp., EC-l1 :683-690, October 1962. \n\nIRE  Trans.  Elec. \n\n[3]  RO. Duda and P.E. Hart.  Pattern Classification and Scene Analysis. Wiley, New York, \n\n1973. \n\n[4]  B.  Frey.  Graphical models for  machine learning  and digital  communication.  MIT \n\nPress,  1998. \n\n[5]  Steffen L. Lauritzen. The EM algorithm for graphical association models with missing \n\ndata.  Computational Statistics and Data Analysis, 19:191-201,1995. \n\n[6]  Judea Pearl.  Probabilistic Reasoning in  Intelligent Systems:  Networks  of Plausible \n\nInference.  Morgan Kaufmann, 1988. \n\n\f", "award": [], "sourceid": 1679, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}]}