{"title": "Thin Junction Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 569, "page_last": 576, "abstract": null, "full_text": "Thin Junction Trees\n\nFrancis R. Bach\n\nComputer Science Division\n\nUniversity of California\n\nBerkeley, CA 94720\nfbach@cs.berkeley.edu\n\nMichael I. Jordan\n\nComputer Science and Statistics\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe present an algorithm that induces a class of models with thin junction\ntrees\u2014models that are characterized by an upper bound on the size of\nthe maximal cliques of their triangulated graph. By ensuring that the\njunction tree is thin, inference in our models remains tractable throughout\nthe learning process. This allows both an ef\ufb01cient implementation of\nan iterative scaling parameter estimation algorithm and also ensures that\ninference can be performed ef\ufb01ciently with the \ufb01nal model. We illustrate\nthe approach with applications in handwritten digit recognition and DNA\nsplice site detection.\n\nIntroduction\n\nMany learning problems in complex domains such as bioinformatics, vision, and infor-\nmation retrieval involve large collections of interdependent variables, none of which has a\nprivileged status as a response variable or class label. In such problems, the goal is gener-\nally that of characterizing the principal dependencies in the data, a problem which is often\ncast within the framework of multivariate density estimation. Simple models are often pre-\nferred in this setting, both for their computational tractability and their relative immunity\nto over\ufb01tting. Thus models involving low-order marginal or conditional probabilities\u2014\ne.g., naive independence models, trees, or Markov models\u2014are in wide use. In problems\ninvolving higher-order dependencies, however, such strong assumptions can be a serious\nliability.\n\nA number of methods have been developed for selecting models of higher-order depen-\ndencies in data, either within the maximum entropy setting\u2014in which features are se-\nlected [9, 16]\u2014and the graphical model setting\u2014in which edges are selected [8]. Sim-\nplicity also plays an important role in the design of these algorithms; in particular, greedy\nmethods that add or subtract a single feature or edge at a time are generally employed. The\nmodel that results at each step of this process, however, is often not simple, and this is\nproblematic both computationally and statistically in large-scale problems.\n\nIn the current paper we describe a methodology that can be viewed as a generalization of\nthe Chow-Liu algorithm for constructing tree models [2]. Note that tree models have the\nproperty that their junction trees have no more than two nodes in any clique\u2014the treewidth\nof tree models is one. In our generalization, we allow the treewidth to be a larger, but still\ncontrolled, value. We \ufb01t data within the space of models having \u201cthin\u201d junction trees.\n\n\fModels with thin junction trees are tractable for exact inference, indeed the complexity of\nany type of inference (joint, marginal, conditional) is controlled by the upper bound that\nis imposed on the treewidth. This makes it possible to achieve some of the \ufb02exibility that\nis often viewed as a generic virtue of generative models, but is not always achievable in\npractice. For example, in the classi\ufb01cation setting we are able to classify partially observed\ndata (e.g., occluded digits) in a simple and direct way\u2014we simply marginalize away the\nunobserved variables, an operation which is tractable in our models. We illustrate this\ncapability in a study of handwritten digit recognition in Section 4.2, where we compare thin\njunction trees and support vector machines (SVMs), a discriminative technique which does\nnot come equipped with a simple and principled method for handling partially observed\ndata. As we will see, thin junction trees are quite robust to missing data in this domain.\n\nThere are a number of issues that need to be addressed in our framework. In particular, tree\nmodels come equipped with particularly ef\ufb01cient algorithms for parameter estimation and\nmodel selection\u2014algorithms which do not generalize readily to non-tree models, including\nthin junction tree models. It is important to show that ef\ufb01cient algorithms can nonetheless\nbe found to \ufb01t such models. We show how this can be achieved in Sections 1, 2 and 3.\nEmpirical results using these algorithms are presented in Section 4.\n\n1 Feature induction\n\nconsiders the closely-related problem of inducing edges rather than features).\n\nEach feature is a function of a certain subset of variables, and we let\n\n.-\n\nis the set of all pairs\n\n\u001c\u001b\u001d\u0004\u001f\u001e\n\n\u0001BA\n\n;(E\n\n , where the set of edges E\n\n1 . With this de\ufb01nition the 0\n\ndistributions (also known as \u201cGibbs\u201d or \u201cmaximum entropy\u201d distributions) based on these\nis a parameter vector,\nis the normalizing constant. (Section 3\n\n\u0011/\u0013\n\n\u0006\u001a\u0019\n\n\u0006\u001a\u0019\n\n\u0004\u000b\n . Consider a vector-valued \u201cfeature\u201d or \u201csuf\ufb01cient statistic\u201d \f\u000e\n\n. Our\nthat minimizes the Kullback-Leibler divergence\n, where\n\nvariables and a target probability distribution \u0002\n\u0010\u000f\u0012\u0011\u0014\u0013\n\n02143657\u001b\nindex the subset of variables referred to by feature \f:1 . Let us consider\n1 are the maximal cliques of the graph\n\nWe assume an input space  with \u0001\ngoal is to \ufb01nd a probability distribution \u0004\n\u0005\u0007\u0006\n\u0003\t\b\nis the dimensionality of the feature space. The feature \f can also be thought in terms\nreal-valued features \u0006\nof its components as a set of \u0015\n\f\u0017\u0016\u0018\n . We focus on exponential family\n\u0006,%\n\n(\n()+* where %\n\u0006\u001a\u0019\n\u0006&%\u0007'\nfeatures: \u0004\n\n! #\"!$\nis a base-measure (typically uniform), and *\n\u0004#\u001e\n8:9:;=<>;\u001f?\u001f?+?@;\nthe undirected graphical model CD\u001b\nincluded in at least one 0\nand, if \u0004#\u001e\nand reference distribution \u0004+\u001e\nwe can de\ufb01ne a junction tree [4], where for all F\n0G1 . The complexity of exact inference depends on the size of the maximal clique of the\ntriangulated graph. We de\ufb01ne the treewidth H of our original graph to be one less than the\nthat a graphical model has a thin junction tree if its treewidth H\nthe best possible \ufb01t to \u0002\nthat would generate a graphical cover with treewidth greater than a given upper bound H\nThe parameter values % are held \ufb01xed during each step of the feature ranking process. Once\n\nOur basic feature induction algorithm is a constrained variant of that proposed by [9]. Given\na set of available features, we perform a greedy search to \ufb01nd the set of features that enables\n, under the constraint of having a thin junction tree. At each step,\ncandidates are ranked according to the gain in KL divergence, with respect to the empirical\ndistribution, that would be achieved by their addition to the current set of features. Features\n\nis decomposable in this graph, the exponential family distribution with features\nis also decomposable in this graph. We assume without\nloss of generality that the graph is connected. For each possible triangulation of the graph,\nthere exists a maximal clique containing\n\nminimum possible value of this maximal clique size for all possible triangulations. We say\n\na set of candidate features are chosen, however, we reestimate all of the parameters (using\nthe algorithm to be described in Section 2) and iterate.\n\nare removed from the ranking.\n\nis small.\n\n\u0003\n\u0002\n\u0015\n\f\n\u001b\n\u0016\n\u0006\n5\n\f\n\u0003\n\fFEATUREINDUCTION\n\n1. Initialization: \u0004\n\n\u0004\u001f\u001e , \f.\u001b\n\n2. Repeat steps (a) to (d) until no further progress is made with respect to a model\n\nselection criterion (e.g., MDL or cross-validation)\n\n, a set of available features\n\n(a) Ranking: generate samples from \u0004 and rank feature candidates according to\n\nremove all candidates that would generate a model with\n\nthe KL gain\n(b) Elimination:\n\ntreewidth greater than H\n\n(c) Selection: select the\n\nbest features\n\n(d) Parameter Estimation: Estimate % using the junction tree implementation of\n\nIterative Scaling (see Section 2)\n\n\u0003\u0005\u0004\n\n\u0003\u0007\u0006\n\nand add them to \f\n\n;\u001f?+?\u001f?#;\n\nFreezing the parameters during the feature ranking step is suboptimal, but it yields an\nessential computational ef\ufb01ciency. In particular, as shown by [9], under these conditions we\n\nmany features very cheaply.\n\n. This equation has only one root and can be solved ef\ufb01ciently by Newton\u2019s method.\nis binary the process is even more ef\ufb01cient\u2014the equation is linear and\n\ncan rank a new feature \f by solving a polynomial equation whose degree is the number of\nvalues \f can take minus one, and whose coef\ufb01cients are expectations under \u0004 of functions\nof \f\nWhen the feature \f\ncan be solved directly. Consequently, with a single set of samples from \u0004 , we can rank\nnumber of nodes whether a graph has a treewidth smaller than H\nlation in which all cliques are of size less thanH\nin H\nantee the existence of a junction tree with a maximal clique no larger than H\n\nFor the feature elimination operation, algorithms exist that determine in time linear in the\n, and if so output a triangu-\n[1]. These algorithms are super-exponential\n, however, and thus are applicable only to problems with small treewidths. In practice\nwe have had success using fast heuristic triangulation methods [11] that allow us to guar-\nfor a given\nmodel. (This is a conservative technique that may occasionally throw out models that in\nfact have small treewidth).\n\nA critical bottleneck in the algorithm is the parameter estimation step, and it is important\nto develop a parameter estimation algorithm that exploits the bounded treewidth property.\nWe now turn to this problem.\n\n2 Iterative Scaling using the junction tree\n\nFitting an exponential family distribution under expectation constraints is a well studied\nproblem; the basic technique is known as Iterative Scaling. A generalization of Iterative\n\nProportional Fitting (IPF), it updates the parameters %\n\nupdate the parameters in parallel have also been proposed; in particular the Generalized\nIterative Scaling algorithm [6], which imposes the constraint that the features sum to one,\nand the Improved Iterative Scaling algorithm [9], which removes this constraint. These\nalgorithms have an important advantage in our setting in that, for each set of parameter\nupdates, they only require computations of expectation that can all be estimated with a\nsingle set of samples from the current distribution.\n\n\u0016 sequentially [5]. Algorithms that\n\nWhen the input dimensionality is large, however, we would like to avoid sampling algo-\nrithms altogether. To do so we exploit the bounded treewidth of our models. We present\na novel algorithm that uses the junction tree and the structure of the problem to speed up\nparameter estimation. The algorithm generalizes to Gibbs distributions the \u201ceffective IPF\u201d\nalgorithm of [10].\n\nWhen working with a junction tree, a ef\ufb01cient way of performing Iterative Scaling is to\nupdate parameters block by block so that each update is performed for a relatively small\n\n\u001b\n\u0001\n;\n%\n\u001b\n\u0001\n\u0002\n\fnumber of features on a small number of variables. Each block can be \ufb01t with any pa-\nrameter estimation algorithm, in particular Improved Iterative Scaling (IIS). The following\nalgorithm exploits this idea by grouping the features whose supports are in the same clique\nof the triangulated graph. Thus, parameter estimation is done in spaces of dimensions at\n\n9 , and all the needed expectations can be evaluated cheaply.\n\nmost H\u0001\n\n2.1 Notation\n\n-dimensional feature. Let \u0006\u0003\u0002\n\nLet \f be our \u0015\n\u0016&\n\n\u001e\u0005\u0004\ntriangulated graph, with potentials \n\u0007\u000b\r\f . We assign each feature \f:1\n;+?\u001f?+?#;\nthat contains 0G1 . For each clique \u0002\u0010\u000e we denote \u0015\u0011\u000e\nassigned to \u0002\n\u000e .\n2.2 Algorithm\n\n\u0004\u0007\u0006\t\b denote the maximal cliques of the\nto one of the cliques \u0002\u000f\u000e\n\f\u000b1\u0015\u0014\u0005\u0016\u000b\n as the set of features\n\n\f\u000b1\u0013\u0012\n\nEFFICIENTITERATIVESCALING\n\n1. Initialization:\n\n;\u001f?+?\u001f?#;\n\n\u2013Construct a junction tree associated with the subsets 8\n\f\u00171\u000b\n\nto one \u0002\n\u000e , such that 0\n\u2013Assign each \f\n\u0014\u0013\u0016\u000b\n for all \u0017 )\n;\u001f?+?\u001f?\u001f;\n\u2013Set %\n\u001b\u0019\u0018 and decompose \u0004\u001f\u001e onto the junction tree\n\u2013Set \u0004\n\u001b\u001b\u001a\n2. Loop until convergence: Repeat step (3) until convergence of the % \u2019s\n3. Loop through all cliques: Repeat steps (a) to (c) for all cliques \u0002\u001c\u000e\n\nsupp\u0006\n\n\u0006,%\n\n(equivalent to determining \u0015\n\n(a) De\ufb01ne the root of the junction tree to be \u0002\u000f\u000e\n(b) Collect evidence from the leaves to the root of the junction tree and normalize\n\npotential \n\n\u001d -dimensional exponential family dis-\n(c) Calculate the maximum likelihood \u001d\ntribution with features \u0015\u001e\u000e and reference distribution \n\u0007\u000b\n\u0016 , using IIS. Replace\n\u0016 by this distribution and add the resulting parameters (one for each feature\n\n\u000b\n;\u001f?+?\u001f?+;\nin \u0015\n\u000e ) to the corresponding % \u2019s: \u0006&%\nis exactly \u0004 marginalized to \u0002\u000f\u000e , so that performing IIS for\n\u000e can be done using \n\ninstead of the full distribution \u0004 . Moreover, each\n\npass through all the cliques is equivalent to one pass of Iterative Scaling and therefore this\nalgorithm converges to the maximum likelihood distribution.\n\nAfter step (b), the potential \n\u0007\u000b\nthe features \u0015\n\n\u0014\u0013\u0016\u0017\n .\n\n3 Edge induction\n\nThus far we have emphasized the exponential family representation. Our algorithm can,\nhowever, be adapted readily to the problem of learning the structure of a graphical model.\nThis is achieved by using features that are indicators of subsets of variables, ensuring that\nthere is one such indicator for every combination of values of the variables in a clique. In\nthis case, Iterative Scaling reduces to Iterative Proportional Fitting.\n\nWe generally employ a further approximation when ranking and selecting edges. In par-\nticular, we evaluate an edge only in terms of the two variables associated directly with the\nedge. The clique formed by the addition of the edge, however, may involve additional\nhigher-order dependencies, which can be parameterized and incorporated in the model.\nEvaluating edges in this way thus underestimates the potential gain in KL divergence.\n\n\u0016\n\u001b\n\u0006\n0\n1\n\u001b\nA\n1\n1\n3\n\u0002\n\u000e\n\u000e\n\u001b\n\u0006\n\f\n1\n\u0012\n\f\n1\n\u001b\n\u0004\n%\n\u0013\n\n\u001b\n\u0004\n\u001e\n\u000e\n\n\u000b\n\u0016\n\u000b\n\u0016\n\u0002\n\u000e\n1\n\u0012\n%\n1\n\u0016\n\u000b\n\u0016\n\f20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n10\n\n20\n\n30\n\nFigure 1: (Left) Circular Boltzmann machine of treewidth 4. (Right) Proportion (in\n) of\nedges not in common between the \ufb01tted model and the generating model vs the number of\navailable training examples (in thousands).\n\nWe should not expect to be able to \ufb01nd an exact edge-selection method\u2014recent work by\nSrebro [15] has shown that the related problem of \ufb01nding the maximum likelihood graphi-\ncal model with bounded treewidth is NP-hard.\n\n4 Empirical results\n\n4.1 Small graphs with known generative model\n\nIn this experiment we generate samples from a known graphical model and \ufb01t our model\nto the data. We consider circular Boltzmann machines of known treewidth equal to 4 as\nshown in Figure 1. Our networks all have 32 nodes and the weights were selected from a\nuniform distribution in\nnumber of training samples, ten replications were performed for each case using our feature\ninduction algorithm with maximum treewidth equal to 4. Figure 1 shows that with enough\nof the original\n\n9\n\u0004=<\u000b\u0006 \u2014so that each edge is signi\ufb01cant. For an increasing\n\nsamples we are able to recover the structure almost exactly (up to \u0018\n\nedges).\n\n9\u0007\u0006\t\b\n\n<\u0005\u0004\n\n\u0001\u0003\u0002\n\n?\r\f\n\n4.2 MNIST digit dataset\n\nIn this section we study the performance of the thin junction tree method on the MNIST\ndataset of handwritten digits. While discriminative methods outperform generative meth-\nods in this high-dimensional setting [12], generative methods offer capabilities that are not\nprovided by discriminative classi\ufb01ers; in particular, the ability to deal with large fractions\nof missing pixels and the ability to to reconstruct images from partial data. It is of interest\nto see how much performance loss we incur and how much robustness we gain by using a\nsophisticated generative model for this problem.\n\nThe MNIST training set is composed of <\u000f\u000e\u0011\u0010\n<\u000f\u000e 4-bit grayscale pixels that have been\nresized and cropped to 9\u0013\u0012\u0014\u0010\u00149\u0015\u0012 binary images (an example is provided in the leftmost plots\n\nin Figure 2). We used thin junction trees as density estimators in the 256-dimensional pixel\nspace by training ten different models, one for each of the ten classes. We used binary\n\nfeatures of the form \u001a\n\nvirtual examples were used. We utilized ten percent fractions of the training data for cross-\nvalidation and test.\n\n . No vision-based techniques such as de-skewing or\n\n\u000e\u0017\u0016\n\nDensity estimation: The leftmost plot in Figure 3 shows how increasing the maximal al-\nlowed treewidth, ranging from 1 (trees) to 15, enables a better \ufb01t to data.\n\nClassi\ufb01cation: We built classi\ufb01ers from the bank of ten thin junction tree (\u201cTJT \u201d) models\nusing one of the following strategies: (1) take the maximum likelihood among the ten\n\n\n\u0002\n\u0001\n\n\u0006\n\u0019\n\u0016\n\u0016\n\u001b\n9\n\fFigure 2: Digit from the MNIST database. From left to right, original digit, cropped and\nresized digits used in our experiments, 50% of missing values, 75% of missing values,\noccluded digit.\n\n70\n\n65\n\n60\n\n55\n\n50\n\n0\n\n5\n\n10\n\n15\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\n50\n\n100\n\nFigure 3: (Left) Negative log likelihood for the digit 2 vs maximal allowed treewidth.\n(Right) Error rate as a function of the percentage of erased pixels for the TJT classi\ufb01er\n(plain) and a support vector machine (dotted). See text for details.\n\nmodels (TJT-ML), or (2) train a discriminative model using the outputs of the ten models.\nWe used softmax regression (TJT-Softmax) and the support vector machine (TJT-SVM ) in\nthe latter case.\n\nThe classi\ufb01cation error rates were as follows: LeNet 0.7, SVM 0.8, Product of experts, 2.0,\nTJT-SVM 3.8, TJT-Softmax 4.2, TJT-ML 5.3, Chow-Liu 8.5, and Linear classi\ufb01er 12.0. (See\n[12] and [13] for further details on the non-TJT models).\n\nIt is important to emphasize that our models are tractable for full joint inference; indeed,\nthe junction trees have a maximal clique size of 10 in the largest models we used on the ten\nclasses. Thus we can use ef\ufb01cient exact calculations to perform inference. The following\ntwo sections demonstrate the utility of this fact.\n\nMissing pixels: We ran an experiment in which pixels were chosen uniformly at random and\nerased, as shown in Figure 2. In our generative model, we treat them as hidden variables\nthat were marginalized out. The rightmost plot in Figure 3 shows the error rate on the\ntesting set as a function of the percentage of unknown pixels, for our models and for a\nSVM. In the case of the SVM, we used a polynomial kernel of degree four [7] and we tried\nvarious heuristics to \ufb01ll in the value of the non-observed pixels, such as the average of that\npixel over the training set or the value of a blank pixel. Best classi\ufb01cation performance\nwas achieved with replacing the missing value by the value of a blank pixel. Note that very\nlittle performance decrement is seen for our classi\ufb01er even with up to 50 percent of the\npixels missing, while for the SVM, although performance is better for small percentages,\nperformance degrades more rapidly as the percentage of erased digits increases.\n\nReconstruction: We conducted an additional experiment in which the upper halves of im-\nages were erased. We ran the junction tree inference algorithm to \ufb01ll in these missing\nvalues, choosing the maximizing value of the conditional probability (max-propagation).\nFigure 3 shows the results. For each line, from left to right, we show the original digit, the\ndigit after erasure, reconstructions based on the model having the maximum likelihood, and\n\n\f0\n\n1\n\n2\n\n3\n\n4\n\n0\n\n1\n\n2\n\n3\n\n4\n\n2\n\n9\n\n6\n\n5\n\n7\n\n6\n\n7\n\n5\n\n8\n\n9\n\n5\n\n6\n\n7\n\n8\n\n9\n\n5\n\n6\n\n7\n\n8\n\n7\n\n8\n\n0\n\n9\n\n6\n\n9\n\n3\n\n2\n\n4\n\n3\n\n3\n\nFigure 4: Reconstructions of images whose upper halves have been deleted. See text for\ndetails.\n\nreconstruction based on the model having the second and third largest values of likelihood.\n\n4.3 SPLICE Dataset\n\nThe task in this dataset is to classify splice junctions in DNA sequences. Splice junctions\ncan either be an exon/intron (EI) boundary, an intron/exon (IE) boundary, or no bound-\nary. (Introns are the portions of genes that are spliced out during transcription; exons are\nretained in the mRNA).\n\nEach sample is a sequence of 60 DNA bases (where each base can take one of four values,\nA,G,C, or T). The three different classes are: EI exactly at the middle (between the 30th\nand the 31st bases), IE exactly at the middle (between the 30th and the 31st bases), no\nsplice junction. The dataset is composed of 3175 training samples. In order to be able to\ncompare to previous experiments using this dataset, performance is assessed by picking\n2000 training data points at random and testing on the 1175 others, with 20 replications.\n\nWe treat classi\ufb01cation as a density estimation problem in this case by treating the class\nthat maximizes the\nvariable\n\nas another variable. We classify by choosing the value of\n\n . We tested both feature induction and edge induction; in the\n\n were\n\nconditional probability\u0003\nformer case only binary features that are products of features of the form \u0016\ntested and induced. MDL was used to pick the number of features or edges.\nOur feature induction algorithm, with a maximum treewidth equal to 5, gave an error rate\n. This is better than\nof\nthe best reported results in the literature; in particular, neural networks have an error rate\nof\n\n, while the edge induction algorithm gave an error rate of \u0004\nand the Chow and Liu algorithm has an error rate of \u0004\n\n\u0011\u001d\n\n\u0006\u001a\u0019\n\n\u001b\u0002\u0001\n\n[14].\n\n\u0005\n\n\n5 Conclusions\n\nWe have described a methodology for feature selection, edge selection and parameter esti-\nmation that can be viewed as a generalization of the Chow-Liu algorithm. Drawing on the\nfeature selection methods of [9, 16], our method is quite general, building an exponential\nfamily model from the general vocabulary of features on overlapping subsets of variables.\nBy maintaining tractability throughout the learning process, however, we build this \ufb02exible\nrepresentation of a multivariate density while retaining many of the desirable aspects of the\nChow-Liu algorithm.\n\n\n\n\u0006\n\u0019\n\u0016\n\u0016\n\u0003\n?\n\u0004\n\n?\n9\n\n\u0005\n?\n?\n\u0004\n\n\fOur methodology applies equally well to feature or edge selection. In large-scale, sparse\ndomains in which over\ufb01tting is of particular concern, however, feature selection may be the\npreferred approach, in that it provides a \ufb01ner-grained search in the space of simple models\nthan is allowed by the edge selection approach.\n\nAcknowledgements\n\nWe wish to acknowledge NSF grant IIS-9988642 and ONR MURI N00014-00-1-0637. The\nresults presented here were obtained using Kevin Murphy\u2019s Bayes Net Matlab toolbox and\nSVMTorch [3].\n\nReferences\n\n[1] H. Bodlaender, A linear-time algorithm for \ufb01nding tree-decompositions of small treewidth,\n\nSiam J. Computing, 25, 105-1317, 1996.\n\n[2] C.K. Chow and C.N. Liu, Approximating discrete probability distributions with dependence\n\ntrees, IEEE Trans. Information Theory, 42, 393-405, 1990.\n\n[3] R. Collobert and S. Bengio, SVMTorch: support vector machines for large-scale regression\n\nproblems, Journal of Machine Learning Research, 1, 143-160, 2001.\n\n[4] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter, Probabilistic Networks and\n\nExpert Systems, Springer-Verlag, New York, 1999.\n\n[5] I. Csisz\u00b4ar, I-divergence geometry of probability distributions and minimization problems, An-\n\nnals of Probability, 3, 146-158, 1975.\n\n[6] J.N. Darroch and D. Ratcliff, Generalized iterative scaling for log-linear models, Ann. Math.\n\nStatist., 43, 1470-1480, 1972.\n\n[7] D. DeCoste and B. Sch\u00a8olkopf, Training invariant support vector machines, Machine Learning,\n\n46, 1-3, 2002.\n\n[8] D. Heckerman, D. Geiger, and D.M. Chickering, Learning Bayesian networks: The combina-\n\ntion of knowledge and statistical data, Machine Learning, 20, 197-243, 1995.\n\n[9] S. Della Pietra, V. Della Pietra, and J. Lafferty, Inducing features of random \ufb01elds, IEEE Trans.\n\nPAMI, 19, 380-393, 1997.\n\n[10] R. Jirousek and S. Preucil, On the effective implementation of the iterative proportional \ufb01tting\n\nprocedure, Computational Statistics and Data Analysis, 19, 177-189, 1995.\n\n[11] U. Kjaerulff, Triangulation of graphs\u2014algorithms giving small total state space, Technical\n\nReport R90-09, Dept. of Math. and Comp. Sci., Aalborg Univ., Denmark, 1990.\n\n[12] Y. Le Cun, http://www.research.att.com/\u02dcyann/exdb/mnist/index.html\n[13] G. Mayraz and G. Hinton, Recognizing hand-written digits using hierarchical products of ex-\n\nperts, Adv. NIPS 13, MIT Press, Cambridge, MA, 2001.\n\n[14] M. Meila and M.I. Jordan, Learning with mixtures of trees, Journal of Machine Learning Re-\n\nsearch, 1, 1-48, 2000.\n\n[15] N. Srebro, Maximum likelihood bounded tree-width Markov networks, in UAI 2001.\n[16] S.C. Zhu, Y.W. Wu, and D. Mumford, Minimax entropy principle and its application to texture\n\nmodeling, Neural Computation, 9, 1997.\n\n\f", "award": [], "sourceid": 2069, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}