{"title": "Nonlinear Markov Networks for Continuous Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 527, "abstract": null, "full_text": "Nonlinear Markov Networks for Continuous \n\nVariables \n\nReimar Hofmann and Volker Tresp* \nSiemens AG, Corporate Technology \nInformation and Communications \n\n81730 Munchen, Germany \n\nAbstract \n\nWe address the problem oflearning structure in nonlinear Markov networks \nwith continuous variables.  This  can  be  viewed  as  non-Gaussian multidi(cid:173)\nmensional density estimation exploiting certain conditional independencies \nin the variables.  Markov networks are  a graphical way of describing con(cid:173)\nditional independencies well suited to model relationships which do not ex(cid:173)\nhibit a natural causal ordering.  We use neural network structures to model \nthe quantitative relationships between variables.  The main focus  in this pa(cid:173)\nper will be on learning the structure for the purpose of gaining insight into \nthe underlying process.  Using two data sets we show that interesting struc(cid:173)\ntures can be found using our approach.  Inference will be briefly addressed. \n\n1 \n\nIntroduction \n\nKnowledge about independence or conditional independence between variables is most help(cid:173)\nful in ''understanding'' a domain.  An intuitive representation of independencies is achieved by \ngraphical models in which independency statements can be extracted from the structure of the \ngraph.  The two most popular types of graphical stochastical models are Bayesian networks \nwhich use a directed graph,  and Markov networks which use an undirected graph.  Whereas \nBayesian  networks  are  well  suited  to  represent  causal  relationships, Markov  networks  are \nmostly used in cases where the user wants to express statistical correlation between variables. \nThis is the case  in image processing where the variables typically represent  the  grey  levels \nof pixels and the graph encourages smootheness in the values of neighboring pixels (Markov \nrandom fields, Geman and Geman,  1984). We believe that Markov networks might be a useful \nrepresentation in many domains where the concept of cause and effect is somewhat artificial. \nThe  learned structure of a Markov network also  seems  to be more easily communicated to \nnon-experts; in a Bayesian network not all arc directions can be uniquely identified based on \ntraining data alone which makes a meaningful interpretation for the non-expert rather difficult. \n\nAs in Bayesian networks, direct dependencies between variables in Markov networks are rep(cid:173)\nresented by an  arc  between those  variables and missing edges  represent  independencies  (in \nSection 2 we will be more precise about the independencies represented in Markov networks). \nWhereas the graphical structure in Markov networks might be known a priori in some cases, \n\nf{eimar.Hofinann@mchp.siemens.de Volker. Tresp@mchp.siemens.de \n\n\f522 \n\nR. Hofmann and V.  Tresp \n\nthe focus of this work is the case that structure is unknown and must be inferred from data. \nFor both discrete variables and  linear relationships between continuous variables algorithms \nfor structure learning exist (Whittaker, 1990). Here we address the problem of learning struc(cid:173)\nture for Markov networks of continuous variables where the relationships between variables \nare  nonlinear.  In particular we use neural  networks for  approximating the  dependency  be(cid:173)\ntween a variable  and its Markov boundary.  We  demonstrate that structural  learning can  be \nachieved without a direct reference to a likelihood function and show how inference in such \nnetworks  can  be  perfonned using  Gibbs  sampling.  From  a  technical  point of view,  these \nMarlwv boundary networks perfonn multi-dimensional density estimation for a very general \nclass of non-Gaussian densities. \n\nIn the next section we give a mathematical description of Markov networks and a formulation \nof the joint probability density  as  a  product of compatibility functions.  In  Section 3.1  we \ndiscuss strucurallearning in Markov networks based on a maximum likelihood approach and \nshow that this approach is in general unfeasible.  We  then introduce our approach  which is \nbased on learning the Markov boundary of each variable. We also show how belief update can \nbe performed using Gibbs sampling.  In  Section 4 we demonstrate that useful structures can \nbe extraced from two data sets (Boston housing data.,  financial market) using our approach. \n\n2  Markov Networks \n\nThe following brief introduction to Markov networks is adapted from Pearl (1988). Consider \na strictly positive I  joint probability density p(x)  over a set of variables X  := {XI, ... , XN }. \nFor each variable Xi,  let the Marlwv boundary of Xi,  Bi  ~ X  - {Xi}, be the smallest set of \nvariables that renders Xi  and X  - ({ xd U Bd independent under p( x)  (the Markov boundary \nis  unique  for  strictly positive  distributions).  Let  the Marlwv  network 9  be  the  undirected \ngraph with nodes Xl, \u2022\u2022\u2022 ,  xN and edges between Xi  and Xj  if and only if Xi  E  Bj  (which also \nimplies X j  E  Bi).  In other words, a Markov network is generated by connecting each node to \nthe nodes in its Markov boundary.  Then for any  set Z  ~ (X - {Xi, Xj}),  Xi  is independent \nof X j  given Z  if and only if every path from Xi  to X j  goes through at least one node in Z.  In \nother words, two variables are independent if any path between those variables is \"blocked\" \nby a known variable.  In particular a variable is independent of the remaining variables if the \nvariables in its Markov boundary are known. \n\nA clique in G is a maximal fully connected sub graph. Given a Markov Network G for p( x) it \ncan be shown that p can be factorized  as  a product of positive functions on the cliques of G, \ni.e. \n\n(1) \n\nwhere  the  product  is  over  all  cliques  in  the  graph.  Xclique,  is  the  projection of X  to  the \nvariables  of the  i-th clique  and  the gi  are  the compatibility functions  w.r.t.  cliquej.  K  = \nJ fli gi(Xclique.)dx  is the normalization constant.  Note,  that a state whose clique functions \nhave  large values has high probability.  The theorem of Hammersley  and Clifford states that \nthe nonnalized product in equation 1 embodies all the conditional independencies portrayed \nby the graph (Pearl,  1988? for any choice of the gi . \n\nIf the  graph is sparse,  i.e.  if many conditional independencies exist then the cliques might \n\n1 To simplify the discussion we will assume strict positivity for the rest of this paper. For some of the \nstatements weaker conditions may also be sufficient.  Note that strict positivity implies that functional \nconstraints (for example, a  =  b) are excluded. \n\n2 In terms of graphical models: The graph G is an I-map of p. \n\n\fNonlinear Markov Networks for Continuous Variables \n\n523 \n\nbe small  and the product will be over low  dimensional functions.  Similar to  Bayesian net(cid:173)\nworks  where  the  complexity of describing  a joint probability density  is greatly reduced by \ndecomposing the joint density in a product of ideally low-dimensional conditional densities, \nequation 1 describes the decomposition of a joint probability density function into a product \nof ideally low-dimensional compatibility functions. It should be noted that Bayesian networks \nand Markov networks differ in which specific independencies they can represent (Pearl, 1988). \n\n3  Learning the Markov Network \n\n3.1  Likelihood Function Based Learning \n\nLearning graphical stochastical models is usually decomposed  into the problems of learning \nstructure (that is the edges in the graph) and  of learning the parameters  of the joint density \nfunction under the constraint that it obeys  the independence statements made  by the graph. \nThe idea is to generate candidate structures according to some search strategy, learn the param(cid:173)\neters for this structure and then judge the structure on the basis of the (penalized) likelihood \nof the model or, in a fully Bayesian approach, using a Bayesian scoring metric. \n\nAssume that the compatibility functions in equation 1 are approximated using a function ap(cid:173)\nproximator such as  a  neural  network gi 0  ~ 9 i (x).  Let  {x P}:= 1  be  a  training set.  With \nlikelihood L  =  I1;=1 pM (xP)  (where the  M  in pM  indicates a probability density model in \ncontrast to the true distribution), the gradient of the log-likelihood with respect to weight W i \nin gi (.)  becomes \n\n~~I  M(  P)-~~l  ~(P \na  L-0gp \n\nx  -L-a \n\noggl  Xclique, \n\np=l  Wi \n\nWi  p=l \n\n)_NI(i!v;loggi(Xclique,))I1jgj(XcliqueJ)dX \n\nII1  W( \n\nj gj  Xclique) \n\n)d \nX \n\n(2) \nwhere the sums are over N  training patterns.  The gradient decomposes into two terms.  Note, \nthat only in the first term the training patterns appear explicitly and that, conveniently, the first \nterm is only dependent on the clique i which contains parameter Wi.  The second term emerges \nfrom  the normalization constant K  in equation  I.  The difficulty  is  that the  integrals  in the \nsecond term can not be solved in closed form for universal types of compatibility functions gi \nand have to be approximated numerically, typically using a form of Monte Carlo integration. \nThis is exactly what is done in the Boltzmann machine, which is a special case of a Markov \nnetwork with discrete variables.3 \n\nCurrently, we consider maximum likelihood learning based on the compatibility functions un(cid:173)\nsuitable, considering the complexity and slowness of Monte Carlo integration (Le.  stochastic \nsampling).  Note, that for structural learning the maximum likelihood learning is in the inner \nloop and would have to be executed repeatedly for a large number of structures. \n\n3.2  Markov Boundary Learning \n\nThe difficulties in using maximum likelihood learning for finding optimal structures motivated \nthe approach  pursued in this paper.  If the underlying true probability density is  known the \nstructure  in  a  Markov  network can  be  found  using either the  edge  deletion  method or the \n\n3 A fully  connected Boltzmann machine does not display any independencies and we only have one \nclique consisting of all variables.  The compatibility function  is gO  = exp (- L: WijSiSj).  The Boltz(cid:173)\nmann machine typically contains hidden variables, such that not only the second tenn (corresponding to \nthe unclamped phase) in equation 2 has to be approximated using stochastic sampling but also the first \ntenn.  (In this paper we only consider the case that data are complete). \n\n\f524 \n\nR.  Hofmann and V  Tresp \n\nMarkov boundary method (Pearl, 1988). The edge deletion method uses the fact that variables \na and b are not connected by an  edge if and only if a and b are independent given all other \nvariables.  Evaluating this test for each pair of variables reveals the structure of the network. \nThe  Markov  boundary method  consists  of determining  - for  each  variable  a  - its  Markov \nboundary and connecting a  to  each variable  in  its Markov boundary.  Both approaches  are \nsimple if we have a reliable test for true conditional independence. \n\nBoth methods  cannot  be  applied  directly  for  learning  structure  from  data  since  here  tests \nfor conditional independence cannot be based on the true underlying probability distribution \n(which is unknown) but has to be inferred from a finite  data set.  The hope is that dependen(cid:173)\ncies which are strong enough to be supported by the data can still be reliably identified. It  is, \nhowever not difficult to construct cases where simply using an (unreliable) statistical test for \nconditional independence with the edge deletion method does not work wel1. 4 \n\nWe now describe our approach,  which is motivated by the Markov boundary method.  First, \nwe start with a fully connected graph.  We  train a model ptt  to approximate the conditional \ndensity of each variable i,  given the current candidate variables for its Markov boundary Bi \nwhich initially are all other variables.  For this we can use a wide variety of neural networks. \nWe use conditional Parzen windows \n\n(3) \n\nwhere {XP};'=l  is the training set and G(x; J-l,  1:) is our notation for a multidimensional Gaus(cid:173)\nsian centered at J-l  with covariance matrix 1: evaluated at x. The Gaussians in the nominator are \ncentered at X~i}U8: which is the location of the p-th sample in the jointinput!output( {x;} UBi) \nspace and the Gaussians in the denominator are centered at  x~: which is the location of the \np-th sample in the input space (Bi).  There  is one covariance matrix 1:i  for each conditional \ndensity model which is shared between all the Gaussians in that model.  1:i  is  restricted to  a \ndiagonal matrix where the diagonal elements in all dimensions except the output dimension i, \nare the same.  So there are only two free parameters in the matrix:  The variance in the output \ndimension and the variance in all  input dimensions.  Ei  8'  is equal to 1:i  except that the row \nand column corresponding to the output dimension ha~e been deleted.  For each conditional \nmodel ptt, 1:i  was optimized on the basis of the leave-one-out cross validation log-likelihood. \nOur approach  is  based  on tentatively removing edges  from  the model.  Removing  an edge \ndecreases  the  size  of the Markov  boundary candidates  of both  affected  variables  and  thus \ndecreases  the number of inputs in the corresponding two conditional density models.  With \nthe inputs removed,  we  retrain the two models  (in our case,  we simply find  the  optimal  Ei \nfor the two conditional Parzen windows).  If the removal  of the edge was correct,  the leave(cid:173)\none-out cross validation log-likelihood (model-score) of the two models should improve since \nan unnecessary input is removed.  (Removing an unnecessary input typically decreases model \nvariance.)  We  therefore remove  an edge if the model-scores of both models improve.  Let's \ndefine as edge-removal-score the smaller ofthe two improvements in model-score. \n\nHere is the algorithm in pseudo code: \n\n\u2022  Start with a fully connected network \n\n4The problem is that in  the edge deletion method the decision is  made independently for  each edge \nwhether or not it should be present There are however cases where it is obvious that at least one of two \nedges must be  present although  the edge deletion  method which  tests  each edge individually  removes \nboth. \n\n. \n\n\fNonlinear Markov Networksfor Continuous Variables \n\n525 \n\n\u2022  Until no edge-removal-score is positive: \n\nfor all edges edgeij  in the network \n* calculate  the  model-scores  of the  reduced  models ptt (Xi IBi  - {j})  and \n\n-\n\n-\n\nptt (Xj IB; - {i}) \n\nM i\nPi  (XjIBj) \n\n* compare  with  the  model-scores  of the  current  models  pM (xiIB~)  and \n\nt \n\nI \n\n* set the edge-removal-score to the smaller of both model-score improvements \nremove the edge for which the edge-removal-score is in maximum. \n\n\u2022  end \n\n3.3 \n\nInference \n\nNote that we have learned the structure of the Markov network without an explicit representa(cid:173)\ntion of the probability density.  Although the conditional densities p(.r i IBi)  provide sufficient \ninformation to  calculate the joint probability density the  latter can not be easily computed. \nMore precisely,  the  conditional  densities  overdetermine the joint density  which might  lead \nto problems if the conditional densities are estimated from data.  For inference, we are typi(cid:173)\ncally interested in the expected value of an unknown variable, given an arbitrary set of known \nvariables,  which can be calculated using Gibbs sampling.  Note,  that the conditional densi(cid:173)\nties pM (Xi IBi)  which are required for Gibbs sampling are explicitly modeled in our approach \nby the conditional Parzen  windows.  Also note,  that  sampling  from  the conditional Parzen \nmodel (as well as  many  other neural networks, such as  mixture of experts models) is  easy.5 \nIn Hofmann (1997) we show that Gibbs sampling from the conditional Parzen models gives \nsignificantly better results than running inference using either a kernel estimator or a Gaussian \nmixture model of the joint density. \n\n4  Experiments \n\nIn our first  experiment we used the Boston housing data set,  which contains  506 samples. \nEach sample consists of the housing price and  13  other variables which supposedly influence \nthe housing price in a Boston neighborhood.  Maximizing the cross validation log-likelihood \nas score as described in the previous chapters results in a Markov network with 68 edges. \n\nWhile cross validation gives an unbiased estimate of whether a direct dependency exists be(cid:173)\ntween two variables the estimate can have a large variance depending on the size of the given \ndata set.  If the goal of the experiment is  to interpret the resulting structure one would prefer \nto see only those edges corresponding to direct dependencies which can be clearly identified \nfrom the given data set.  In other words, if the relationship between two variables observed on \nthe given data set is so  weak that we can not be sure that it is not just an effect of the finite \ndata set size, then we do not want to display the corresponding edge.  This can be achieved by \nadding a penalty per edge to the score of the conditional density models. (figure 1). \n\nFigure 2 shows the resulting Markov network for a penalty per edge of 0.2.  The goal of the \noriginal experiment for which the Boston housing data were collected was to examine whether \nthe air quality (5) has direct influence on the housing price (14).  Our algorithm did not find \nsuch an influence - in accordance with the original study. It found that the percentage of low \nstatus population (13) and the average  number of rooms  (6)  are  in direct relationship with \nthe housing price.  The pairwise relationships between these three variables are displayed in \nfigure 3. \n\n5 Readers not familiar with Gibbs sampling, please consult Geman and Geman (1984). \n\n\f526 \n\nR. Hofmann and V.  Tresp \n\n\u00b0 0 \n\n0011 \n\n01 \n\n016 \n\n02  on \n\n......... \n\n0 3 \n\n0\u00bb  \n\noc \n\no.~  os \n\nFigure  I:  Number of edges in the Markov network for the Boston housing data as  a function \nof the penalty per edge. \n\nI \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \nII \n12 \n13 \n14 \n\ncrime rate \npercent  land zoned for  lots \npercent  nooretail  bu.. in\"\", \nlocated on  Charies river'! \nnitrogen  oxide concentration \naverage number of room., \npercent  bui I t before  1940 \nweighted distance to employment  center \nacces. to radial highways \ntax rate \npupi IIteacher ratio \npercent  black \npercent  lower-status population \nmedian  value ofbomes \n\nFigure 2:  Final structure of a run on the full Boston housing data set (penalty =  0.2). \n\nThe scatter plots visualize the relationship between variables 13 and 14, 6 and 14 and between \n6  and  13  (from  left to  right).  The  left  and  the  middle correspond to  edges  in  the  Markov \nnetwork whereas for the right diagram the corresponding edge (6-13) is missing even though \nboth variables are clearly dependent. The reason is, that the dependency between 6 and 13 can \nbe explained as  indirect relationship via variable 14.  The Markov network tells us that 13 and \n6 are independent given 14, but dependent if 14 is unknown. \n\nIn a second experiment we used a financial dataset.  Each pattern corresponds to one business \nday.  The variables in our model are relative changes in certain economic variables from the \nlast business day to the present day which were expected to possibly influence the development \nof the German stock index DAX and the composite DAX, which contains a larger selection of \nstocks than the DAX.  We used 500 training patterns consisting of 12 variables (figure 4).  In \ncomparison to the Boston housing data set most relationships are very weak.  Using a penalty \nper edge  of 0.2  leads to  a very  sparse model  with only three  edges  (2-12,  12-1 ,5-11) (not \nshown).  A penalty of 0.025 results in the model shown in figure  4.  Note, that the composite \n\n50~- - -\n\n- ....  - - ~----.-. \u2022 -\n\n-' \n\n.0--- ----- -- - ---\n\n.\n\n.. ,I : \n\n:. \n\n10 \n\no--- ----~-- ...J \no \n\n10 \n~ \nPc  Low  Status Population \n\n~ \n\n~ \n\nO'--~- . _\n34 56 7 89  \n\n_ ______ ____ -..l \n\nA.  Number of Rooms \n\nFigure 3:  Pairwise relationship between  the variables 6,  13  and  14.  Displayed are  all  data \npoints in the Boston housing data set. \n\n\fNonlinear Markov Networks/or Continuous Variables \n\n527 \n\nDAX \ncomposite DAX \n3 month  interest rates Gennany \nrerum  Gennany \nMorgan  Stanley tndex Germany \nI)(JW'  Jones mdustrial  index \nDM-USD exchange rate \nUS treasury  bonds \ngold price in DM \nN.kkei index Japan \nMorgan  Stanley index Europe \nprice earning  ratio (DAX stocks) \n\n4 \n5 \n6 \n7 \n8 \n9 \n10 \nII \n12 \n\nFigure 4:  Final structure of a run on the financial data set with a penalty of 0.025.  The small \nnumbers next to  the edges indicate the strength of the connection, i.e.  the decrease  in score \n(excluding the penalty) when  the edge is removed.  All variables are relative changes  - not \nabsolute values. \n\nDAX  is connected to  the DAX mainly through the price earning ratio.  While the DAX has \ndirect connections to the Nikkei index and to the DM-USD exchange rate the composite DAX \nhas  a  direct  connection to  the  Morgan  Stanley  index for  Germany.  Recall,  that composite \nDAX contains  the  stocks  of many  smaller companies  in  addition to  the  DAX  stocks.  The \ngraph structure might be  interpreted (with all caution)  in  the  way  that  the  composite DAX \n(including small companies) has a stronger dependency on national business whereas the DAX \n(only including the stock of major companies) reacts more to international indicators. \n\n5  Conclusions \n\nWe have demonstrated, to our knowledge for the first  time, how nonlinear Markov networks \ncan be learned for continuous variables and we have  shown that the resulting structures can \ngive interesting insights into the underlying process. We used a representation based on mod(cid:173)\nels of the conditional probability density of each variable given its Markov boundary.  These \nmodels can be trained locally. We showed how searching in the space of all possible structures \ncan be done using this representation. \n\nWe suggest to use the conditional densities of each variable given its Markov boundary also for \ninference by Gibbs sampling.  Since the required conditional densities are modeled explicitly \nby our approach and sampling from these is easy, Gibbs sampling is easier and faster to realize \nthan with a direct representation of the joint density. \n\nA  topic of further research  is  the variance in resulting structures,  i.e.  the fact  that different \nstructures  can  lead  to  almost  equally good  models.  It would for  example  be  desirable  to \nindicate to the user in a principled way the certainty of the existence or nonexistence of edges. \n\nReferences \nGeman, S., and Geman, D. (1984).  Stochastic relaxations, Gibbs distributions and the Bayesian restora(cid:173)\ntion of images. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-6 (no.  6):721-42 \nHofinann, R.  (1997).  Inference in Markov Blanket Models. Technical report, in preparation. \nMonti,  S., and Cooper, G. (1997).  Learning Bayesian belief networks with neural network estimators. \nIn Neural Information Processing Systems 9., MIT Press. \nPearl, J. (1988).  Probabilistic reasoning in intelligent systems. San Mateo:  Morgan Kaufinann. \nWhittaker, J.  (1990).  Graphical models in  applied multivariate statistics.  Chichester, UK:  John Wiley \nand Sons. \n\n\f", "award": [], "sourceid": 1341, "authors": [{"given_name": "Reimar", "family_name": "Hofmann", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}