{"title": "Discovering Weakly-Interacting Factors in a Complex Stochastic Process", "book": "Advances in Neural Information Processing Systems", "page_first": 481, "page_last": 488, "abstract": null, "full_text": "Discovering Weakly-Interacting Factors in a Complex\n\nStochastic Process\n\nSchool of Engineering and Applied Sciences\n\nSchool of Engineering and Applied Sciences\n\nCharlie Frogner\n\nHarvard University\n\nCambridge, MA 02138\n\nAvi Pfeffer\n\nHarvard University\n\nCambridge, MA 02138\n\nfrogner@seas.harvard.edu\n\navi@eecs.harvard.edu\n\nAbstract\n\nDynamic Bayesian networks are structured representations of stochastic pro-\ncesses. Despite their structure, exact inference in DBNs is generally intractable.\nOne approach to approximate inference involves grouping the variables in the\nprocess into smaller factors and keeping independent beliefs over these factors.\nIn this paper we present several techniques for decomposing a dynamic Bayesian\nnetwork automatically to enable factored inference. We examine a number of fea-\ntures of a DBN that capture different types of dependencies that will cause error in\nfactored inference. An empirical comparison shows that the most useful of these\nis a heuristic that estimates the mutual information introduced between factors\nby one step of belief propagation. In addition to features computed over entire\nfactors, for ef\ufb01ciency we explored scores computed over pairs of variables. We\npresent search methods that use these features, pairwise and not, to \ufb01nd a factor-\nization, and we compare their results on several datasets. Automatic factorization\nextends the applicability of factored inference to large, complex models that are\nundesirable to factor by hand. Moreover, tests on real DBNs show that automatic\nfactorization can achieve signi\ufb01cantly lower error in some cases.\n\n1 Introduction\n\nDynamic Bayesian networks (DBNs) are graphical model representations of discrete-time stochastic\nprocesses. DBNs generalize hidden Markov models and are used for modeling a wide range of\ndynamic processes, including gene expression [1] and speech recognition [2]. Although a DBN\nrepresents the process\u2019s transition model in a structured way, all variables in the model might become\njointly dependent over the course of the process and so exact inference in a DBN usually requires\ntracking the full joint probability distribution over all variables; it is generally intractable. Factored\ninference approximates this joint distribution over all variables as the product of smaller distributions\nover groups of variables (factors) and in this way enables tractable inference for large, complex\nmodels. Inference algorithms based on this idea include Boyen-Koller [3], the Factored Frontier [4]\nand Factored Particle Filtering [5].\nFactored inference has generally been demonstrated for models that are factored by hand. In this\npaper we will show that it is possible algorithmically to select a good factorization, thus not only\nextending the applicability of factored inference to larger models, for which it might be undesireable\nmanually to choose a factorization, but also allowing for better (and sometimes \u2019non-obvious\u2019)\nfactorizations. The quality of a factorization is de\ufb01ned by the amount of error incurred by repeatedly\ndiscarding the dependencies between factors and treating them as independent during inference. As\nsuch we formulate the goal of our algorithm as the minimization over factorizations of an objective\nthat describes the error we expect due to this type of approximation. For this purpose we have\nexamined a range of features that can be computed from the speci\ufb01cation of the DBN, based both on\n\n1\n\n\fthe underlying graph structure and on two essential conceptions of weak interaction between factors:\nthe degree of separability [6] and mutual information. For each principle we investigated a number\nof heuristics. We \ufb01nd that the mutual information between factors that is introduced by one step of\nbelief state propagation is especially well-suited to the problem of \ufb01nding a good factorization.\nComplexity is an issue in searching for good factors, as the search space is large and the scoring\nheuristics themselves are computationally intensive. We compare several search methods for \ufb01nding\nfactors that allow for different tradeoffs between the ef\ufb01ciency and the quality of the factorization.\nThe fastest is a graph partitioning algorithm in which we \ufb01nd a k-way partition of a weighted graph\nwith edge-weights being pairwise scores between variables. Agglomerative clustering and local\nsearch methods use the higher-order scores computed between whole factors, and are hence slower\nwhile \ufb01nding better factorizations. The more expensive of these methods are most useful when run\nof\ufb02ine, for example when the DBN is to be used for online inference and one cares about \ufb01nding\na good factorization ahead of time. We additionally give empirical results on two other real DBN\nmodels as well as randomly-generated models. Our results show that dynamic Bayesian networks\ncan be decomposed ef\ufb01ciently and automatically, enabling wider applicability of factored inference.\nFurthermore, tests on real DBNs show that using automatically found factors can in some cases yield\nsigni\ufb01cantly lower error than using factors found by hand.\n\n2 Background\n\nA dynamic Bayesian network (DBN), [7] [8], represents a dynamic system consisting of some set of\nvariables that co-evolve in discrete timesteps. In this paper we are dealing with discrete variables.\nWe denote the set of variables in the system by X, with the canonical variables being those that\ndirectly in\ufb02uence at least one variable in the next timestep. We call the probability distribution\nover the possible states of the system at a given timestep the belief state. The DBN gives us the\nprobabilities of transitioning from any given system state at t to any other system state at time t + 1,\nthe probability that a variable takes on a given state at t + 1\nand it does so in a factored way:\ndepends only on the states of a subset of the variables in the system at t. We can hence represent\nthis transition model as a Bayesian network containing the variables in X at timestep t, denoted\nXt, and the variables in X at timestep t + 1, say Xt+1 \u2013 this is called a 2-TBN (for two-timeslice\nBayesian network). By inferring the belief state over Xt+1 from that over Xt, and conditioning on\nobservations, we propagate the belief state through the system dynamics to the next timestep. The\nspeci\ufb01cation of a DBN also includes a prior belief state at time t = 0.\nNote that, although each variable at t + 1 may only depend on a small subset of the variables at t,\nits state might be correlated implicitly with the state of any variable in the system, as the in\ufb02uence\nof any variable might propagate through intervening variables over multiple timesteps. As a result,\nthe whole belief state over X (at a given timestep) in general is not factored. Boyen and Koller, [3],\n\ufb01nd that, despite this fact, we can factor the system into components whose belief states are kept\nindependently, and the error incurred by doing so remains bounded over the course of the process.\nThe BK algorithm hence approximates the belief state at a given timestep as the product of the\nlocal belief states for the factors (their marginal distributions), and does exact inference to propagate\nthis approximate belief state to the next timestep. Both the Factored Frontier, [4], and Factored\nParticle, [5], algorithms also rely on this idea of a factored belief state representation.\nIn [9] and [6], Pfeffer introduced conditions under which a single variable\u2019s (or factor\u2019s) marginal\ndistribution will be propagated accurately through belief state propagation, in the BK algorithm. The\ndegree of separability is a property of a conditional probability distribution that describes the degree\nto which that distribution can be decomposed as the sum of simpler conditional distributions, each\nof which depends on only a subset of the conditioning variables. For example, let p(Z|XY ) give\nthe probability distribution for Z given X and Y . If p(Z|XY ) is separable in terms of X and Y to\na degree \u03b1, this means that we can write\n\np(Z|XY ) = \u03b1[\u03b3pX(Z|X) + (1 \u2212 \u03b3)pY (Z|Y )] + (1 \u2212 \u03b1)pXY (Z|XY )\n\n(1)\nfor some conditional probability distributions pX(Z|X), pX(Z|Y ), and pXY (Z|XY ) and some\nparameter \u03b3. We will say that the degree of separability is the maximum \u03b1 such that there exist\npX(Z|X), pX(Z|Y ), and pXY (Z|XY ) and \u03b3 that satisfy (1). [9] and [6] have shown that if a\nsystem is highly separable then the BK algorithm produces low error in the components\u2019 marginal\ndistributions.\n\n2\n\n\fPrevious work has explored bounds on the error encountered by the BK algorithm. [3] showed that\nthe error over the course of a process is bounded with respect to the error incurred by repeatedly\nprojecting the exact distribution onto the factors as well as the mixing rate of the system, which can\nbe thought of as the rate at which the stochasticity of the system causes old errors to be forgotten. [10]\nanalyzed the error introduced between the exact distribution and the factored distribution by just one\nstep of belief propagation. The authors noted that this error can be decomposed as the sum of\nconditional mutual information terms between variables in different factors and showed that each\nsuch term is bounded with respect to the mixing rate of the subsystem comprising the variables in\nthat term. Computing the value of this error decomposition, unfortunately, requires one to examine\na distribution over all of the variables in the model, which can be intractable. Along with other\nheuristics, we examined two approaches to automatic factorization that seek directly to exploit the\nabove results, labeled in-degree and out-degree in Table 1.\n\n3 Automatic factorization with pairwise scores\n\nWe \ufb01rst investigated a collection of features, computable from the speci\ufb01cation of the DBN, that\ncapture different types of pairwise dependencies between variables. These features are based both\non the 2-TBN graph structure and on two conceptions of interaction:\nthe degree of separability\nand mutual information. These methods allow us to factorize a DBN without computing expensive\nwhole-factor scores.\n\n3.1 Algorithm: Recursive min-cut\nWe use the following algorithm to \ufb01nd a factorization using only scores between pairs of variables.\nWe build an undirected graph over the canonical variables in the DBN, weighting each edge between\ntwo variables with their pairwise score. An obvious algorithm for \ufb01nding a partition that minimizes\npairwise interactions between variables in different factors would be to compute a k-way min-cut,\ntaking, say, the best-scoring such partition in which all factors are below a size limit. Unfortunately,\non larger models this approach underperforms, yielding many partitions of size one. Instead we\n\ufb01nd that a good factorization can be achieved by computing a recursive min-cut, recurring until all\nfactors are smaller than the pre-de\ufb01ned maximum size. We begin with all variables in a single factor.\nAs long as there exists a factor whose weight is larger than the maximum, we do the following. For\neach factor that is too large, we search over the number of smaller factors, k, into which to divide\nthe large factor, for each k computing the k-way min-cut factorization of the variables in the large\nfactor. In our experiments we use a spectral graph partitioning algorithm, [11], e.g. We choose the k\nthat minimizes the overall sum of between-factor scores. This is repeated until all factors are of sizes\nless than the maximum. This min-cut approach is designed only to use scores computed between\npairs of variables, and so it sacri\ufb01ces optimality for signi\ufb01cant speed gains.\n3.2 Pairwise scores\nGraph structure\nAs a baseline in terms of speed and simplicity, we \ufb01rst investigated three types of pairwise graph\nrelationships between variables that are indicative of different types of dependency.\n\u2022 Children of common parents. Suppose that two variables at time t + 1, Xt+1 and Yt+1, depend\non some common parents Zt. As X and Y share a common, direct in\ufb02uence, we might expect\nthem to to become correlated over the course of the process. The score between X and Y is the\nnumber of parents they share in the 2-TBN.\n\u2022 Parents of common children. Suppose that Xt and Yt jointly in\ufb02uence common children Zt+1.\nThen we might care more about any correlations between X and Y , because they jointly in\ufb02u-\nence Z. If X and Y are placed in separate factors, then the accuracy of Z\u2019s marginal distribution\nwill depend on how correlated X and Y were. Here the score between X and Y is the number\nof children they share in the 2-TBN.\n\u2022 Parent to child. If Yt+1 directly depends on Xt, or Xt+1 on Yt, then we expect them to be\n\ncorrelated. The score between X and Y is the number of edges between them in the 2-TBN.\n\n3\n\n\fDegree of separability\nThe degree of separability for a given factor\u2019s conditional distribution in terms of the other factors\ngives a measure of how accurately the belief state for that factor will be propagated via that condi-\ntional distribution to the next timestep, in BK inference. When a factor\u2019s conditional distribution is\nhighly separable in terms of the other factors, ignored dependencies between the other factors lead\nto relatively small errors in that factor\u2019s marginal belief state after propagation. We can hence use\nthe degree of separability as an objective to be maximized: we want to \ufb01nd the factorization that\nyields the highest degree of separability for each factor\u2019s conditional distribution. Computing the\ndegree of separability is a constrained optimization problem, and [12] gives an approximate method\nof solution. For distributions over many variables the degree of separability is quite expensive to\ncompute, as the number of variables in the optimization grows exponentially with the number of\ndiscrete variables in the input conditional distribution. Computing the degree of separability for\na small distribution is, however, reasonably ef\ufb01cient. In adapting the degree of separability to a\npairwise score for the min-cut algorithm, we took two approaches.\n\u2022 Separability of the pair\u2019s joint conditional distribution: We assign a score to the pair of canon-\nical variables X and Y equal to the degree of separability for the joint conditional distribution\np(Xt+1Yt+1|P arents(Xt+1)\u222a P arents(Yt+1)). We want to maximize this value for variables\nthat are joined in a factor, as a high degree of separability implies that the error of the factor\nmarginal distribution after propagation in BK will be low. Note that the degree of separability is\nde\ufb01ned in terms of groups of parent variables. If we have, for example, p(Z|W XY ), then this\ndistribution might be highly separable in terms of the groups XY and W , but not in terms of\nW X and Y . If, however, p(Z|W XY ) is highly separable in terms of W , X and Y grouped sep-\narately, then it is at least as separable in terms of any other groupings. We compute the degree of\nseparability for the above joint conditional distribution in terms of the parents taken separately.\n\u2022 Non-separability between parents of a common child: If two parents are highly non-separable in\na common child\u2019s conditional distribution, then the child\u2019s marginal distribution can be rendered\ninaccurate by placing these two parents in different components. For two variables X and Y , we\nrefer to the shared children of Xt and Yt in timeslice t + 1 as Zt+1. The strength of interaction\nbetween X and Y is de\ufb01ned to be the average degree of non-separability for each variable in\nZt+1 in terms of its parents taken separately. The degree of non-separability is one minus the\ndegree of separability.\n\nMutual information\nWhereas the degree of separability is a property of a single factor\u2019s conditional distribution, the\nmutual information between two factors measures their joint dependencies. To compute it exactly\nrequires, however, that we obtain a joint distribution over the two factors. All we are given is a DBN\nde\ufb01ning the conditional distribution over the next timeslice given the previous, and some initial\ndistribution over the variables at time 1. In order to obtain a suitable joint distribution over the\nvariables at t + 1 we must assume a prior distribution over the variables at time t. We therefore\nexamine several features based on the mutual information that we can compute from the DBN in\nthis way, to capture different types of dependencies.\n\u2022 Mutual information after one timestep: We assume a prior distribution over the variables at time\nt and do one step of propagation to get a marginal distribution over Xt+1 and Yt+1. We then use\nthis marginal to compute the mutual information between X and Y , thus estimating the degree\nof dependency between X and Y that results from one step of the process.\n\u2022 Mutual information between timeslices t and t+1: We measure the dependencies resulting from\nX and Y directly in\ufb02uencing each other between timeslices: the more information Xt carries\nabout Yt+1, the more we expect them to become correlated as the process evolves. Again, we\nassume a prior distribution at time t and use this to obtain the joint distribution p(Yt+1Xt)),\nfrom which we can calculate their mutual information. We sum the mutual information between\nXt and Yt+1 and that between Yt and Xt+1 to get the score.\n\u2022 Mutual information from the joint over both timeslices: We take into account all possible direct\nin\ufb02uences between X and Y, by computing the mutual information between the sets of variables\n(Xt \u222a Xt+1) and (Yt \u222a Yt+1). As before, we assume a prior distribution at time t to compute a\njoint distribution p((Xt\u222aXt+1)\u222a(Yt\u222aYt+1)), from which we can get the mutual information.\n\n4\n\n\fThere are many possibilities for a prior distribution at time t. We can assume a uniform distribution,\nin which case the resulting mutual information values are exactly those introduced by one step of\ninference, as all variables are independent at time t. More costly would be to generate samples from\nthe DBN and to do inference, computing the average mutual information values observed over the\nsteps of inference. We found that, on small examples, there was little practical bene\ufb01t to doing the\nlatter. For simplicity we use the uniform prior, although the effects of different prior assumptions\ndeserves further inquiry.\n3.3 Empirical comparison\nWe compared the preceding pairwise scores by factoring randomly-generated DBNs, using the BK\nalgorithm for belief state monitoring. We computed two error measures. The \ufb01rst is the joint belief\nstate error, which is the relative entropy between the product of the factor marginal belief states and\nthe exact joint belief state. The second is the average factor belief state error, which is the average\nover all factors of the relative entropy between each factor\u2019s marginal distribution and the equivalent\nmarginal distribution from the exact joint belief state. We were constrained in choosing datasets on\nwhich exact inference is tractable, which limited both the number of state variables and the number\nof parameters per variable. Note that in our tables the joint KL distance is always given in terms of\n10\u22122, while the factor marginal KL distance is in terms of 10\u22124.\nFor this comparison we used two datasets. The \ufb01rst is a large, relatively uncomplicated dataset that\nis intended to elucidate basic distinctions between the different heuristics. It consists of 400 DBNs,\neach of which contains 12 binary-valued state variables and 4 noisy observation variables. We tried\nto capture the tendency in real DBNs for variables to depend on a varying number of parents by\ndrawing the number of parents for each variable from a gaussian distribution of mean 2 and standard\ndeviation 1 (rounding the result and truncating at zero), and choosing parents uniformly from among\nthe other variables. In real models variables usually, but not always, depend on themselves in the\nprevious timeslice, and each variable in our networks also depended on itself with a probability of\n0.75. Finally, the parameters for each variable were drawn randomly with a uniform prior.\nThe second dataset is intended to capture more complicated structures commonly seen in real DBNs:\ndeterminisim and context-speci\ufb01c independence. It consists of 50 larger models, each with 20 binary\nstate variables and 8 noisy observation variables. Parents and parameters were chosen as before, ex-\ncept that in this case we chose several variables to be deterministic, each computing a boolean\nfunction of its parents, and several other variables to have tree-structured context-speci\ufb01c indepen-\ndence. To generate context-speci\ufb01c independence, the variable\u2019s parents were randomly permuted\nand between one half and all of the parents were chosen each to induce independence between the\nchild variable and the parents lower in the tree, conditional upon one of its states.\nThe results are shown in Table 1. For reference we have shown two additional methods that minimize\nthe maximum out-degree and in-degree of factors. These are suggested by Boyen and Koller as a\nmeans of controlling the mixing rate of factored inference, which is used to bound the error. In all\ncases, the mutual-information based factorizations, and in particular the mutual information after\none timestep, yielded lower error, both in the joint belief state and in the factor marginal belief\nstates. The degree of separability is apparently not well-adapted to a pairwise score, given that it is\nnaturally de\ufb01ned in terms of an entire factor.\n4 Exploiting higher-order interactions\nThe pairwise heuristics described above do not take into account higher-order properties of whole\ngroups of variables: the mutual information between two factors is usually not exactly the sum of its\nconstituent pairwise information relationships, and the degree of separability is naturally formulated\nin terms of a whole factor\u2019s conditional distribution and not between arbitrary pairs of variables.\nTwo search algorithms allow us to use scores computed for whole factors, and to \ufb01nd better factors\nwhile sacri\ufb01cing speed.\n4.1 Algorithms: Agglomerative clustering and local search\n\nAgglomerative clustering begins with all canonical variables in separate factors, and at each step\nchooses a pair of factors to merge such that the score of the factorization is minimized. If a merger\nleads to a factor of size greater than some given maximum, it is ignored. The algorithm stops when\nno advantageous merger is found. As the factors being scored are always of relatively small size,\nagglomerative clustering allows us to use full-factor scores.\n\n5\n\n\fTable 1: Random DBNs with pairwise scores\n\nOut-degree\nIn-degree\nChildren of common parents\nParents of common children\nParent to child\nSeparability between parents\nSeparability of pairs of variables\nMut. information after timestep\nMut. information between timeslices\nMut. information from both timeslices\n\n12 nodes\n\n20 nodes/determinism/CSI\n\nJoint KL Factor KL Joint KL\n\u00d710\u22124\n\u00d710\u22124\n16.0\n2.50\n15.1\n2.44\n2.61\n15.5\n11.9\n1.98\n14.9\n2.28\n15.3\n2.69\n18.5\n2.80\n1.11\n7.11\n9.73\n1.62\n1.65\n10.5\n\n\u00d710\u22122\n1.25\n1.20\n1.87\n1.01\n1.19\n1.09\n1.27\n0.408\n0.664\n0.575\n\nFactor KL\n\u00d710\u22122\n10.0\n8.54\n10.0\n5.92\n6.62\n14.0\n12.0\n3.44\n4.96\n5.15\n\nLocal search begins with some initial factorization and attempts to \ufb01nd a factorization of minimum\nscore by iteratively modifying this factorization. More speci\ufb01cally, from any given factorization\nmoves of the following three types are considered: create a new factor with a single node, move a\nsingle node from one factor into another, or swap a pair of nodes in different factor. At each iteration\nonly those moves that do not yield a factor of size greater than some given maximum are considered.\nThe move that yields the lowest score at that iteration is chosen. If there is no move that decreases\nthe score (and so we have hit a local minimum), however, the factors are randomly re-initialized and\nthe algorithm continues searching, terminating after a \ufb01xed number of iterations. The factorization\nwith the lowest score of all that were examined is returned. As with agglomerative clustering, local\nsearch enables the use of full-factor scores. We have found that good results are achieved when\nthe factors are initialized (and re-initialized) to be as large as possible. In addition, although the\nthird type of move (swapping) is a composition of the other two, we have found that the sequence\nof moves leading to an advantageous swap is not always a path of strictly decreasing scores, and\nperformance degrades without it.\nWe note that all of the algorithms bene\ufb01t greatly from caching the components of the scores that are\ncomputed.\n4.2 Empirical comparison\nWe veri\ufb01ed that the results for the pairwise scores extend to whole-factor scores on a dataset of\n120 randomly-generated DBNs, each of which contained 8 binary-valued state variables. We were\nsigni\ufb01cantly constrained in our choice of models by the complexity of computing the degree of\nseparability for large distributions: even on these smaller models, doing agglomerative clustering\nwith the degree of separability sometimes took over 2 hours and local search much longer. We have\ntherefore con\ufb01ned our comparison to agglomerative clustering on 8-variable models. We divided the\ndataset into three groups to explore the effects of both extensive determinism and context-speci\ufb01c\nindependence separately.\nThe mutual information after one timestep again produced the lowest error in both in the factor\nmarginal belief states and in the joint belief state. For the networks with large amounts of context-\nspeci\ufb01c independence, the degree of separability was always close to one, and this might have\nhampered its effectiveness for clustering. Interestingly, we see that agglomerative clustering can\nsometimes produce results that are worse than those for graph partitioning, although local search\nconsistently outperforms the two. This may be due to the fact that agglomerative clustering tends\nto produce smaller clusters than the divisive approach. Finally, we note that, although determin-\nism greatly increased error, the relative performance of the different heuristics and algorithms was\nunchanged. Local search consistently found lower-error factorizations.\nWe further compared the different algorithms on the dataset with 12 state variables per DBN, from\nSection 3.3, using the mutual information after one timestep score. It is perhaps surprising that\nthe graph min-cut algorithm can perform comparably with the others, given that it is restricted to\npairwise scores.\n\n6\n\n\fTable 2: Random DBNs using pairwise and whole-factor scores\n\nScore type/Search algorithm\n\n8 nodes\n\nJoint\n\nFactor\n\n8 nodes/determ.\nJoint\nFactor\n\n8 nodes/CSI\nJoint\nFactor\n\nSeparability between parents:\nMin-cut\nSeparability b/t pairs of variables:\nMin-cut\nWhole-factor separability:\nAgglomerative\nMut. info. after one timestep:\nMin-cut\nAgglomerative\nLocal search\nMut. info. between timeslices:\nMin-cut\nAgglomerative\nLocal search\nMut. info. both timeslices:\nMin-cut\nAgglomerative\nLocal search\n\n2.36\n\n2.42\n\n2.19\n\n1.20\n1.15\n1.05\n\n1.62\n1.60\n1.40\n\n1.88\n1.86\n1.70\n\n2.54\n\n2.12\n\n1.23\n\n1.00\n1.13\n0.90\n\n1.17\n1.45\n1.20\n\n1.51\n1.08\n0.95\n\n38.9\n\n27.2\n\n31.1\n\n18.1\n19.0\n13.8\n\n27.7\n27.6\n23.8\n\n22.9\n25.1\n23.1\n\n70\n\n139\n\n61\n\n44\n43\n32\n\n47\n61\n44\n\n45\n62\n26\n\n0.82\n\n0.56\n\n0.99\n\n0.25\n0.20\n0.18\n\n0.55\n0.53\n0.52\n\n0.64\n0.66\n0.58\n\n0.45\n\n0.31\n\n0.46\n\n0.11\n0.11\n0.098\n\n0.24\n0.32\n0.32\n\n0.36\n0.34\n0.29\n\n5 Factoring real models\nBoyen and Koller, [3], demonstrated factored inference on two models that were factored by hand:\nthe Bayesian Automated Taxi network and the water network. Table 3 shows the performance\nof automatic factorization on these two DBNs.\nIn both cases automatic factorization recovered\nreasonable factorizations that performed better than those found manually.\nThe Bayesian Automated Taxi (BAT) network, [13],\nis intended to monitor highway traf-\n\ufb01c and car state for an automated driving system. The DBN contains 10 persistent state\nLocal search with factors of 5 or fewer variables\nvariables and 10 observation variables.\nyielded exactly the 5+5 clustering given in the paper. When allowing 4 or fewer vari-\nables per factor, local search and agglomerative search both recovered the factorization ([Left-\nClr], [RightClr], [LatAct+Xdot+InLane], [FwdAct+Ydot+Stopped+EngStatus], [FrontBack-\nStatus]), while graph min-cut found ([EngStatus], [FrontBackStatus], [InLane], [Ydot], [Fw-\ndAct+Ydot+Stopped+EngStatus], [LatAct+LeftClr]). The manual factorization from [3] is\n([LeftClr+RightClr+LatAct], [Xdot+InLane], [FwdAct+Ydot+Stopped+EngStatus], [Front-\nBackStatus]). The error results are shown in Table 3. Local search took about 300 seconds to\ncomplete, while agglomerative clustering took 138 seconds and graph min-cut 12 seconds.\nThe water network is used for monitoring the biological processes of a water puri\ufb01cation plant. It has\n8 state variables and 4 observation variables (labeled A through H), and all variables are discrete with\n3 or 4 states. The agglomerative and local search algorithms yielded the same result ([A+B+C+E],\n[D+F+G+H]) and graph min-cut was only slightly different ([A+C+E], [D+F+G+H], [B]). The\nmanual factorization from [3] is ([A+B],[C+D+E+F],[G+H]). The results in terms of KL distance\nare shown in Figure 3. The automatically recovered factorizations were on average at least an order\nof magnitude better. Local search took about one minute to complete, while agglomerative clustering\ntook 30 seconds and graph min-cut 3 seconds.\n\n6 Conclusion\nWe compared several heuristics and search algorithms for automatically factorizing a dynamic\nBayesian network. These techniques attempt to minimize an objective score that captures the ex-\ntent to which dependencies that are ignored by the factored approximation will lead to error. The\nheuristics we examined are based both on the structure of the 2-TBN and on the concepts of degree\nof separability and mutual information. The mutual information after one step of belief propaga-\n\n7\n\n\fTable 3: Algorithm performance\n12-var. random\nJnt.\n\nFact.\n\nFact.\n\nBAT\n\nJnt.\n\nWater\n\nJnt.\n\nFact.\n\nMin-cut\nAgglomerative\nLocal search\nManual\n\n1.08\n1.10\n1.06\n\n-\n\n0.433\n0.55\n0.52\n\n-\n\n14.7\n0.390\n0.390\n5.62\n\n0.723\n0.0485\n0.0485\n0.0754\n\n0.430\n0.0702\n0.0702\n3.12\n\n1.32\n0.566\n0.566\n2.12\n\ntion has generally been greatly more effective than the others as an objective for factorization. We\npresented three search methods that allow for tradeoffs between computational complexity and the\nquality of the factorizations they produce. Recursive min-cut ef\ufb01ciently uses scores between pairs\nof variables, while agglomerative clustering and local search both use scores computed between\nwhole factors \u2013 the latter two are slower, while achieving better results. Automatic factorization can\nextend the applicability of factored inference to larger models for which it is undesireable to \ufb01nd\nfactors manually. In addition, tests run on real DBNs show that automatically factorized DBNs can\nachieve signi\ufb01cantly lower error than hand-factored models. Future work might explore extensions\nto overlapping factors, which have been found to yield lower error in some cases.\n\nAcknowledgments\n\nThis work was funded by an ONR project, with special thanks to Dr. Wendy Martinez.\n\nReferences\n[1] Sun Yong Kim, Seiya Imot, and Satoru Miyano.\n\nInferring gene networks from time series\n\nmicroarray data using dynamic Bayesian networks. Brie\ufb01ngs in Bioinformatics, 2003.\n\n[2] Geoffrey Zweig and Stuart Russell. Dynamic Bayesian networks for speech recognition. In\n\nNational Conference on Arti\ufb01cial Intelligence (AAAI), 1998.\n\n[3] Xavier Boyen and Daphne Koller. Tractable inference for complex stochastic processes. In\n\nNeural Information Processing Systems, 1998.\n\n[4] Kevin Murphy and Yair Weiss. The factored frontier algorithm for approximate inference in\n\nDBNs. In Uncertainty in Arti\ufb01cial Intelligence, 2001.\n\n[5] Brenda Ng, Leonid Peshkin, and Avi Pfeffer. Factored particles for scalable monitoring. In\n\nUncertainty in Arti\ufb01cial Intelligence, 2002.\n\n[6] Avi Pfeffer. Approximate separability for weak interaction in dynamic systems. In Uncertainty\n\nin Arti\ufb01cial Intelligence, 2006.\n\n[7] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation.\n\nComputational Intelligence, 1989.\n\n[8] Kevin Murphy. Dynamic Bayesian networks: representation, inference and learning. PhD\n\nthesis, U.C. Berkeley, Computer Science Division, 2002.\n\n[9] Avi Pfeffer. Suf\ufb01ciency, separability and temporal probabilistic models.\n\nArti\ufb01cial Intelligence, 2001.\n\nIn Uncertainty in\n\n[10] Xavier Boyen and Daphne Koller. Exploiting the architecture of dynamic systems. In Pro-\n\nceedings AAAI-99, 1999.\n\n[11] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: analysis and an algorithm.\n\nIn Neural Information Processing Systems, 2001.\n\n[12] Charlie Frogner and Avi Pfeffer. Heuristics for automatically decomposing a dynamic\nBayesian network for factored inference. Technical Report TR-04-07, Harvard University,\n2007.\n\n[13] Jeff Forbes, Tim Huang, Keiji Kanazawa, and Stuart Russell. The BATmobile:\n\ntowards a\n\nBayesian automatic taxi. In International Joint Conference on Arti\ufb01cial Intelligence, 1995.\n\n8\n\n\f", "award": [], "sourceid": 1119, "authors": [{"given_name": "Charlie", "family_name": "Frogner", "institution": null}, {"given_name": "Avi", "family_name": "Pfeffer", "institution": null}]}