{"title": "Discriminative Learning of Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3239, "page_last": 3247, "abstract": "", "full_text": "Discriminative Learning of Sum-Product Networks\n\nRobert Gens\n\nPedro Domingos\n\nDepartment of Computer Science and Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195-2350, U.S.A.\n\n{rcg,pedrod}@cs.washington.edu\n\nAbstract\n\nSum-product networks are a new deep architecture that can perform fast, exact in-\nference on high-treewidth models. Only generative methods for training SPNs\nhave been proposed to date.\nIn this paper, we present the \ufb01rst discriminative\ntraining algorithms for SPNs, combining the high accuracy of the former with\nthe representational power and tractability of the latter. We show that the class\nof tractable discriminative SPNs is broader than the class of tractable generative\nones, and propose an ef\ufb01cient backpropagation-style algorithm for computing the\ngradient of the conditional log likelihood. Standard gradient descent suffers from\nthe diffusion problem, but networks with many layers can be learned reliably us-\ning \u201chard\u201d gradient descent, where marginal inference is replaced by MPE infer-\nence (i.e., inferring the most probable state of the non-evidence variables). The\nresulting updates have a simple and intuitive form. We test discriminative SPNs\non standard image classi\ufb01cation tasks. We obtain the best results to date on the\nCIFAR-10 dataset, using fewer features than prior methods with an SPN architec-\nture that learns local image structure discriminatively. We also report the highest\npublished test accuracy on STL-10 even though we only use the labeled portion\nof the dataset.\n\n1\n\nIntroduction\n\nProbabilistic models play a crucial role in many scienti\ufb01c disciplines and real world applications.\nGraphical models compactly represent the joint distribution of a set of variables as a product of fac-\ntors normalized by the partition function. Unfortunately, inference in graphical models is generally\nintractable. Low treewidth ensures tractability, but is a very restrictive condition, particularly since\nthe highest practical treewidth is usually 2 or 3 [2, 9]. Sum-product networks (SPNs) [23] overcome\nthis by exploiting context-speci\ufb01c independence [7] and determinism [8]. They can be viewed as a\nnew type of deep architecture, where sum layers alternate with product layers. Deep networks have\nmany layers of hidden variables, which greatly increases their representational power, but inference\nwith even a single layer is generally intractable, and adding layers compounds the problem [3].\nSPNs are a deep architecture with full probabilistic semantics where inference is guaranteed to be\ntractable, under general conditions derived by Poon and Domingos [23]. Despite their tractability,\nSPNs are quite expressive [16], and have been used to solve dif\ufb01cult problems in vision [23, 1].\nPoon and Domingos introduced an algorithm for generatively training SPNs, yet it is generally\nobserved that discriminative training fares better. By optimizing P (Y|X) instead of P (X, Y) con-\nditional random \ufb01elds retain joint inference over dependent label variables Y while allowing for\n\ufb02exible features over given inputs X [22]. Unfortunately, the conditional partition function Z(X)\nis just as prone to intractability as with generative training. For this reason, low treewidth models\n(e.g. chains and trees) of Y are commonly used. Research suggests that approximate inference can\nmake it harder to learn rich structured models [21]. In this paper, discriminatively training SPNs\nwill allow us to combine \ufb02exible features with fast, exact inference over high treewidth models.\n\n1\n\n\fWith inference and learning that easily scales to many layers, SPNs can be viewed as a type of\ndeep network. Existing deep networks employ discriminative training with backpropagation through\nsoftmax layers or support vector machines over network variables. Most networks that are not purely\nfeed-forward require approximate inference. Poon and Domingos showed that deep SPNs could be\nlearned faster and more accurately than deep belief networks and deep Boltzmann machines on a\ngenerative image completion task [23]. This paper contributes a discriminative training algorithm\nthat could be used on its own or with generative pre-training.\nFor the \ufb01rst time we combine the advantages of SPNs with those of discriminative models. In this\npaper we will review SPNs and describe the conditions under which an SPN can represent the con-\nditional partition function. We then provide a training algorithm, demonstrate how to compute the\ngradient of the conditional log-likelihood of an SPN using backpropagation, and explore variations\nof inference. Finally, we show state-of-the-art results where a discriminatively-trained SPN achieves\nhigher accuracy than SVMs and deep models on image classi\ufb01cation tasks.\n\n2 Sum-Product Networks\n\nThe network polynomial of \u03a6(x) is de\ufb01ned as(cid:80)\n\nx \u03a6(x)(cid:81)(x), where(cid:81)(x) is the product of indica-\n\nSPNs were introduced with the aim of identifying the most expressive tractable representation pos-\nsible. The foundation for their work lies in Darwiche\u2019s network polynomial [14]. We de\ufb01ne an un-\nnormalized probability distribution \u03a6(x) \u2265 0 over a vector of Boolean variables X. The indicator\nfunction [.] is one when its argument is true and zero otherwise; we abbreviate [Xi] and [ \u00afXi] as xi and\n\u00afxi. To distinguish random variables from indicator variables, we use roman font for the former and\nitalic for the latter. Vectors of variables are denoted by bold roman and bold italic font, respectively.\ntors that are one in state x. For example, the network polynomial of the Bayesian network X1 \u2192 X2\nis P (x1)P (x2|x1)x1x2 + P (x1)P (\u00afx2|x1)x1 \u00afx2 + P (\u00afx1)P (x2|\u00afx1)\u00afx1x2 + P (\u00afx1)P (\u00afx2|\u00afx1)\u00afx1 \u00afx2. To\ncompute P (X1 = true, X2 = false), we access the corresponding term of the network polynomial\nby setting indicators x1 and \u00afx2 to one and the rest to zero. To \ufb01nd P (X2 = true), we \ufb01x evidence on\nX2 by setting x2 to one and \u00afx2 to zero and marginalize X1 by setting both x1 and \u00afx1 to one. Notice\nthat there are two reasons we might set an indicator xi = 1: (1) evidence {Xi = true}, in which\ncase we set \u00afxi = 0 and (2) marginalization of Xi, where \u00afxi = 1 as well. In general the role of an\nindicator xi is to determine whether terms compatible with variable state Xi = true are included in\nthe summation, and similarly for \u00afxi. With this notation, the partition function Z can be computed\nby setting all indicators of all variables to one.\nThe network polynomial has size exponential in the number of variables, but in many cases it can\nbe represented more compactly using a sum-product network [23, 14].\nDe\ufb01nition 1. (Poon & Domingos, 2011) A sum-product network (SPN) over variables X1, . . . , Xd\nis a rooted directed acyclic graph whose leaves are the indicators x1, . . . , xd and \u00afx1, . . . , \u00afxd and\nwhose internal nodes are sums and products. Each edge (i, j) emanating from a sum node i has a\nnon-negative weight wij. The value of a product node is the product of the values of its children.\nj\u2208Ch(i) wijvj, where Ch(i) are the children of i and vj is the value of\n\nThe value of a sum node is(cid:80)\n\nnode j. The value of an SPN S[x1, \u00afx1, . . . , xd, \u00afxd] is the value of its root.\n\nIf we could replace the exponential sum over\nvariable states in the partition function with\nthe linear evaluation of the network, inference\nwould be tractable. For example,\nthe SPN\nin Figure 1 represents the joint probability of\nthree Boolean variables P (X1, X2, X3) in the\nBayesian network X2 \u2190 X1 \u2192 X3 using six\nindicators S[x1, \u00afx1, x2, \u00afx2, x3, \u00afx3]. To com-\npute P (X1 = true), we could sum over the\njoint states of X2 and X3, evaluating the net-\nwork a total of four times S[1, 0, 0, 1, 0, 1]+. . .+\nS[1, 0, 1, 0, 1, 0]. Instead, we set the indicators\nso that the network sums out both X2 and X3.\nAn indicator setting of S[1,0,1,1,1,1] computes\n\nFigure 1: SPN over Boolean variables X1, X2, X3\n\n2\n\n+++++++x2x2x3x3x1x10.80.20.30.50.10.70.60.40.50.9\fthe sum over all states compatible with our evidence e = {X1 = true} and requires only one evalua-\ntion.\nHowever, not every SPN will have this property. If a linear evaluation of an SPN with indicators\nset to represent evidence equals the exponential sum over all variable states consistent with that\nevidence, the SPN is valid.\nDe\ufb01nition 2. (Poon & Domingos, 2011) A sum-product network S is valid iff S(e) = \u03a6S(e) for all\nevidence e.\n\nIn their paper, Poon and Domingos prove that there are two conditions suf\ufb01cient for validity: com-\npleteness and consistency.\nDe\ufb01nition 3. (Poon & Domingos, 2011) A sum-product network is complete iff all children of the\nsame sum node have the same scope.\nDe\ufb01nition 4. (Poon & Domingos, 2011) A sum-product network is consistent iff no variable appears\nnegated in one child of a product node and non-negated in another.\nTheorem 1. (Poon & Domingos, 2011) A sum-product network is valid if it is complete and consis-\ntent.\n\nThe scope of a node is de\ufb01ned as the set of variables that have indicators among the node\u2019s de-\nscendants. To \u201cappear in a child\u201d means to be among that child\u2019s descendants.\nIf a sum node\nis incomplete, the SPN will undercount the true marginals. Since an incomplete sum node has\nscope larger than a child, that child will be non-zero for more than one state of the sum (e.g.\nif\nS[x1, \u00afx1, x2, \u00afx2] = (x1 + x2), S[1, 0, 1, 1] < S[1, 0, 1, 0] + S[1, 0, 0, 1]). If a product node is incon-\nsistent, the SPN will overcount the marginals as it will incorporate impossible states (e.g. x1 \u00d7 \u00afx1)\ninto its computation.\nPoon and Domingos show how to generatively train the parameters of an SPN. One method is to\ncompute the likelihood gradient and optimize with gradient descent (GD). They also show how\nto use expectation maximization (EM) by considering each sum node as the marginalization of a\nhidden variable [17]. They found that online EM using most probable explanation (MPE or \u201chard\u201d)\ninference worked the best for their image completion task.\nGradient diffusion is a key issue in training deep models. It is commonly observed in neural net-\nworks that when the gradient is propagated to lower layers it becomes less informative [3]. When\nevery node in the network takes fractional responsibility for the errors of a top level node, it be-\ncomes dif\ufb01cult to steer parameters out of local minima. Poon and Domingos also saw this effect\nwhen using gradient descent and EM to train SPNs. They found that online hard EM could provide\na sparse but strong learning signal to synchronize the efforts of upper and lower nodes. Note that\nhard training is not exclusive to EM. In the next section we show how to discriminatively train SPNs\nwith hard gradient descent.\n\n3 Discriminative Learning of SPNs\nWe de\ufb01ne an SPN S[y, h|x] that takes as input three disjoint sets of variables H, Y, and X (hidden,\nquery, and given). We denote the setting of all h indicator functions to 1 as S[y, 1|x], where the\nbold 1 is a vector. We do not sum over states of given variables X when discriminatively training\nSPNs. Given an instance, we treat X as constants. This means that one ignores X variables in the\nscope of a node when considering completeness and consistency. Since adding a constant as a child\nto a product node cannot make that product inconsistent, a variable x can be the child of any product\nnode in a valid SPN. To maintain completeness, x can only be the child of a sum node that has scope\noutside of Y or H.\n\nAlgorithm 1: LearnSPN\nInput: Set D of instances over variables X and label variables Y, a valid SPN S with initialized parameters.\nOutput: An SPN with learned weights\nrepeat\n\nforall the d \u2208 D do\n\nUpdateWeights(S, Inference(S,xd,yd))\n\nuntil convergence or early stopping condition;\n\n3\n\n\fThe parameters of an SPN can be learned using an online procedure as in Algorithm 1 as proposed\nby Poon and Domingos. The three dimensions of the algorithm are generative vs. discriminative,\nthe inference procedure, and the weight update. Poon and Domingos discussed generative gradient\ndescent with marginal inference as well as EM with marginal and MPE inference. In this section we\nwill derive discriminative gradient descent with marginal and MPE inference, where hard gradient\ndescent can also be used for generative training. EM is not typically used for discriminative training\nas it requires modi\ufb01cation to lower bound the conditional likelihood [25] and there may not be a\nclosed form for the M-step.\n\n3.1 Discriminative Training with Marginal Inference\n\nA component of the gradient of the conditional log likelihood takes the form\n\n(cid:88)\n\n(cid:88)\n\ny(cid:48),h\n\nlog P (y|x) =\n\n\u2202\n\u2202w\n\n\u2202\n\u2202w\n\nlog\n\n\u03a6(Y = y, H = h|x) \u2212 \u2202\nlog\n\u2202w\n\u2202S[1, 1|x]\n\u2202S[y, 1|x]\n\n1\n\nh\n\n\u03a6(Y = y(cid:48), H = h|x)\n\n=\n\n1\n\nS[y, 1|x]\n\n\u2202w\n\n\u2212\n\nS[1, 1|x]\n\n\u2202w\n\nwhere the two summations are separate bottom-up evaluations of the SPN with indicators set as\nS[y, 1|x] and S[1, 1|x], respectively.\nThe partial derivatives of the SPN with respect to all weights can be computed with backpropagation,\ndetailed in Algorithm 2. After performing a bottom-up evaluation of the SPN, partial derivatives are\npassed from parent to child as follows from the chain rule and described in [15]. The form of\nbackpropagation presented takes time linear in the number of nodes in the SPN if product nodes\nhave a bounded number of children.\nOur gradient descent update then follows the direction of the partial derivative of the conditional\n\u2202w log P (y|x). After each gradient step we optionally\nlog likelihood with learning rate \u03b7: \u2206w = \u03b7 \u2202\nrenormalize the weights of a sum node so they sum to one. Empirically we have found this to pro-\nduce the best results. The second SPN evaluation that marginalizes H and Y can reuse computation\nfrom the \ufb01rst, for example, when Y is modeled by a root sum node. In this case the values of all\nnon-root nodes are equivalent between the two evaluations. For any architecture, one can memoize\nvalues of nodes that do not have a query variable indicator as a descendant.\n\nAlgorithm 2: BackpropSPN\nInput: A valid SPN S, where Sn denotes the value of node n after bottom-up evaluation.\nOutput: Partial derivatives of the SPN with respect to every node \u2202S\n\u2202Sn\nInitialize all \u2202S\n\u2202S = 1\nforall the n \u2208 S in top-down order do\n\u2202Sn\n\n= 0 except \u2202S\n\nand weight\n\n\u2202wi,j\n\n\u2202S\n\nif n is a sum node then\n\nforall the j \u2208 Ch(n) do\n+ wn,j\n\u2202S\n\u2202Sn\n\n\u2190 \u2202S\n\u2190 Sj\n\n\u2202S\n\u2202Sj\n\u2202S\n\n\u2202wn,j\n\n\u2202Sj\n\n\u2202S\n\u2202Sn\n\nelse\n\nforall the j \u2208 Ch(n) do\n+ \u2202S\n\u2202Sn\n\n\u2190 \u2202S\n\n\u2202S\n\u2202Sj\n\n\u2202Sj\n\n(cid:81)\n\nk\u2208Ch(n)\\{j} Sk\n\n3.2 Discriminative Training with MPE Inference\n\nThere are several reasons why MPE inference is appealing for discriminatively training SPNs. As\ndiscussed above, hard inference was crucial for overcoming gradient diffusion when generatively\ntraining SPNs. For many applications the goal is to predict the most probable structure, and therefore\nit makes sense to use this also during training. Finally, it is common to approximate summations\nwith maximizations for reasons of speed or tractability. Though summation in SPNs is fast and\nexact, MPE inference is still faster. We derive discriminative gradient descent using MPE inference.\n\n4\n\n\f\u2202\n\u2202w\n\nFigure 2: Positive and negative terms in the hard gradient. The root node sums out the variable Y,\nthe two sum nodes on the left sum out the hidden variable H1, the two sum nodes on the right sum\nout H2, and a circled \u2018f\u2019 denotes an input variable Xi. Dashed lines indicate negative elements in\nthe gradient.\nWe de\ufb01ne a max-product network (MPN) M [y, h|x] based on the max-product semiring. This\n\nnetwork compactly represents the maximizer polynomial maxx \u03a6(x)(cid:81)(x), which computes the\n\nlog \u02dcP (y|x) =\n\n\u03a6(Y = y, H = h|x) \u2212 \u2202\n\u2202w\n\nMPE [15]. To convert an SPN to an MPN, we replace each sum node by a max node, where weights\non children are retained. The gradient of the conditional log likelihood with MPE inference is then\n\u03a6(Y = y(cid:48), H = h|x)\nwhere the two maximizations are computed by M [y, 1|x] and M [1, 1|x]. MPE inference also\nconsists of a bottom-up evaluation followed by a top-down pass. Inference yields a branching path\nthrough the SPN called a complete subcircuit that includes an indicator (and therefore assignment)\nfor every variable [15]. Analogous to Viterbi decoding, the path starts at the root node and at each\nmax (formerly sum) node it only travels to the max-valued child. At product nodes, the path branches\nto all children. We de\ufb01ne W as the multiset of weights traversed by this path1. The value of the\ni , where ci is the number of times wi appears in W .\nThe partial derivatives of the MPN with respect to all nodes and weights is computed by Algorithm\n2 modi\ufb01ed to accommodate MPNs: (1) S becomes M, (2) when n is a sum node, the body of the\nforall loop is run once for j as the max-valued child.\nThe partial derivative of the logarithm of an MPN with respect to a weight takes the form\n\nMPN takes the form of a product(cid:81)\n\nwi\u2208W wci\n\nlog max\ny(cid:48),h\n\n\u2202\n\u2202w\n\nlog max\n\nh\n\n(cid:81)\n\n(cid:81)\n\nci \u00b7 wci\u22121\n\ni\n\nwj\u2208W\\{wi} wcj\n\nj\n\nwj\u2208W wcj\n\nj\n\n\u2202 log M\n\n\u2202wi\n\n=\n\n\u2202 log M\n\n\u2202M\n\n\u2202M\n\u2202wi\n\n=\n\n1\nM\n\n\u2202M\n\u2202wi\n\n=\n\n=\n\nci\nwi\n\nThe gradient of the conditional log likelihood with MPE inference is therefore \u2206ci/wi, where\n\u2206ci = c(cid:48)i \u2212 c(cid:48)(cid:48)i\nis the difference between the number of times wi is traversed by the two MPE\ninference paths in M [y, 1|x] and M [1, 1|x], respectively. The hard gradient update is then \u2206wi =\n\u03b7 \u2202\n\u2202wi\n\nlog \u02dcP (y|x) = \u03b7 \u2206ci\n\nwi\n\n.\n\nThe hard gradient for a training instance (xd, yd) is illustrated in Figure 2. In the \ufb01rst two expres-\nsions, the complete subcircuit traveled by each MPE inference is shown in bold. Product nodes do\nnot have weighted children, so they do not appear in the gradient, depicted in the last expression\nWe can also easily add regularization to SPN training. An L2 weight penalty takes the familiar\nform of \u2212\u03bb||w||2 and partial derivatives \u22122\u03bbwi can be added to the gradient. With an appropriate\noptimization method, an L1 penalty could also be used for learning with marginal inference on dense\nSPN architectures. However, sparsity is not as important for SPNs as it is for Markov random \ufb01elds,\nwhere a non-zero weight can have outsize impact on inference time; with SPNs inference is always\nlinear with respect to model size.\nA summary of the variations of Algorithm 1 is provided in Tables 1 and 2. The generative hard\ngradient can be used in place of online EM for datasets where it would be prohibitive to store\ninference results from past epoch. For architectures that have high fan-in sum nodes, soft inference\nmay be able to separate groups of modes faster than hard inference, which can only alter one child\nof a sum node at a time.\nWe observe the similarity between the updates of hard EM and hard gradient descent. In particular,\n(cid:48)\ni, the form of\nif we reparameterize the SPN so that each child of a sum node is weighted by wi = ew\n\n1A consistent SPN allows for MPE inference to reach the same indicator more than once in the same\n\nbranching path\n\n5\n\n++++++++++++++++++++++++++++++++++++ffffffffffffffffffffffffffffffffffff\fNode\nSum\n\nProduct\n\nWeight\n\n\u2202S\n\u2202Sn\n\n\u2202S\n\u2202Sn\n\n\u2202S\n\u2202wki\n\n= (cid:80)\n= (cid:80)\n\nk\u2208P a(n)\n\nk\u2208P a(n)\n= \u2202S\nSi\n\u2202Sk\n\nTable 1: Inference procedures\n\nSoft Inference\n\n(cid:81)\n\n\u2202S\n\u2202Sk\n\nwkn\n\nl\u2208Ch(k)\\{n}\n\u2202S\n\u2202Sk\n\nSl\n\n\u2202M\n\u2202Mn\n\n\u2202M\n\u2202Mn\n\n\u2202M\n\u2202wki\n\nTable 2: Weight updates\n\n(cid:81)\n\nHard Inference\n\n\u2202M\n\u2202Mk\n\n(cid:26)wkn\n\n0\n\nMl\n\nl\u2208Ch(k)\\{n}\n\u2202M\n\u2202Mk\n\n: wkn \u2208 W\n: otherwise\n\n= (cid:80)\n= (cid:80)\n\nk\u2208P a(n)\n\nk\u2208P a(n)\n= \u2202M\nMi\n\u2202Mk\n\nSoft Inference\n\nUpdate\nGen. GD \u2206w = \u03b7 \u2202S[x,y]\nGen. EM P (Hk = i|x, y) \u221d wki\nDisc. GD \u2206w = \u03b7\n\n(cid:16)\n\n\u2202w\n\n1\n\n\u2206wi = \u03b7 ci\nwi\nP (Hk = i|x, y) =\n\n\u2202S[x,y]\n\n\u2202Sk\n\u2202w \u2212 \u2206wi = \u03b7 \u2206ci\n\u2202S[y,1|x]\n\u2202S[1,1|x]\n\nwi\n\n\u2202w\n\nS[y,1|x]\nS[1,1|x]\n\n1\n\nHard Inference\n\n(cid:26)1\n\n0\n\n: wki \u2208 W\n: otherwise\n\n(cid:17)\n(cid:81)\n(cid:81)\n\nci\n\nthe partial derivative of the log MPN becomes\n\n\u2202 log M\n\n\u2202w(cid:48)\n\ni\n\n=\n\n1\nM\n\n\u2202M\n\u2202w(cid:48)\n\ni\n\n=\n\n\u2208W (cid:48) ecj\u00b7w(cid:48)\nw(cid:48)\n\u2208W (cid:48) ecj\u00b7w(cid:48)\n\nj\n\nj\n\nj\n\nw(cid:48)\n\nj\n\n= ci\n\nThis means that the hard gradient update for weights in logspace is \u2206w(cid:48)i = \u2206ci, which resembles\nstructured perceptron [13].\n\n4 Experiments\n\nWe have applied discriminative training of SPNs to image classi\ufb01cation benchmarks. CIFAR-10\nand STL-10 are standard datasets for deep networks and unsupervised feature learning. Both are\n10-class small image datasets. We achieve the best results to date on both tasks.\nWe follow the feature extraction pipeline of Coates et al. [10], which was also used recently to\nlearn pooling functions [20]. The procedure consists of extracting 4 \u00d7 105 6x6 pixel patches from\nthe training set images, ZCA whitening those patches [19], running k-means for 50 rounds, and\nthen normalizing the dictionary to have zero mean and unit variance. We then use the dictionary\nto extract K features at every 6x6 pixel site in the image (unit stride) with the \u201ctriangle\u201d encoding\nfk(x) = max{0, \u00afz \u2212 zk}, where zk = ||x \u2212 ck||2, ck is the k-th item in the dictionary, and \u00afz is the\naverage zk. For each image of CIFAR-10, for example, this yields a 27\u00d7 27\u00d7 K feature vector that\nis \ufb01nally downsampled by max-pooling to a G \u00d7 G \u00d7 K feature vector.\nWe experiment with a simple architecture that\nallows for discriminative learning of\nlocal\nstructure. This architecture cannot be gener-\natively trained as it violates consistency over\nInspired by the successful star models in\nX.\nFelzenszwalb et al.\n[18], we construct a net-\nwork with C classes, P parts per class, and T\nmixture components per part. A part is a pat-\ntern of image patch features that can occur any-\nwhere in the image (e.g. an arrangement of\npatches that de\ufb01nes a curve). Each part \ufb01lter\n(cid:126)fcpt is of dimension W \u00d7 W \u00d7 K and is ini-\ntialized to (cid:126)0. The root of the SPN is a sum node\nwith a child Sc for each class c in the dataset\nmultiplied by the indicator for that state of the\nlabel variable Y. Sc is a product over P nodes\nScp, where each Scp is a sum node over T nodes\n\nFigure 3: SPN architecture for experiments. Hid-\nden variable indicators omitted for legibility.\n\n6\n\nGxGxKMixturexParts+++ClassesLocationWxWxKexij\u00b7f111\fScpt. The hidden variables H represent the choice of cluster in the mixture over a part and its posi-\ntion (Scp and Scpt, respectively). Finally, Scpt sums over positions i, j in the image of the logistic\nfunction e(cid:126)xij\u00b7 (cid:126)fcpt where the given variable (cid:126)xij is the same dimension as f and parts can overlap.\nNotice that the mixture Scp models an additional level of spatial structure on top of the image patch\nfeatures learned by k-means. Coates and Ng [12] also learn higher-order structure, but whereas our\nmethod learns structure discriminatively in the context of a parts-based model, their unsupervised\nalgorithm greedily groups features based on correlation and is unable to learn mixtures. Compared\nwith the pooling functions in Jia et al. [20] that model independent translation of patch features,\nour architecture models how nearby features move together. Other deep probabilistic architectures\nshould be able to model high-level structure, but considering the dif\ufb01culty in training these models\nwith approximate inference, it is hard to make full use of their representational power. Unlike the\nstar model of Felzenswalb et al. [18] that learns \ufb01lters over prede\ufb01ned HOG image features, our\nSPN learns on top of learned image features that can model color and detailed patterns.\nGenerative SPN architectures on the same features produce unsatisfactory results as generative train-\ning is led astray by the large number of features, very few of which differentiate labels. In the gen-\nerative SPN paper [23], continuous variables are modeled with univariate Gaussians at the leaves\n(viewed as a sum node with in\ufb01nite children but \ufb01nite weight sum). With discriminative training, X\ncan be continuous because we always condition on it, which effectively folds it into the weights.\nAll networks are learned with stochastic gradient descent regularized by early stopping. We found\nthat using marginal inference for the root node and MPE inference for the rest of the network worked\nbest. This allows the SPN to continue learning the difference between classes even when it correctly\nclassi\ufb01es a training instance. The fraction of the training set reserved for validation with CIFAR-\n10 and STL-10 were 10% and 20%, respectively. Learning rates, P , and T were chosen based on\nvalidation set performance.\n\n4.1 Results on CIFAR-10\nCIFAR-10 consists of 32x32 pixel images: 5\u00d7 104 for training and 104 for testing. We \ufb01rst compare\ndiscriminative SPNs with other methods as we vary the size of the dictionary K. The results are\nseen in Figure 4. To fairly compare with recent work [10, 20] we also set G = 4. In general,\nwe observe that SPNs can achieve higher performance using half as many features as the next best\napproach, the learned pooling function. We hypothesize that this is because the SPN architecture\nallows us to discriminatively train large moveable parts, image structure that cannot be captured by\nlarger dictionaries. In Jia et al. [20] the pooling functions blur individual features (i.e. a 6x6 pixel\ndictionary item), from which the classi\ufb01er may have trouble inferring the coordination of image\nparts.\nWe then experimented with a \ufb01ner grid and fewer dictionary items (G = 7, K = 400). Pooling\nfunctions destroy information, so it is better if less is done before learning. Finer grids are less\nfeasible for the method in Jia et al.\n[20] as the number of rectangular pooling functions grows\nO(G4). Our best test accuracy of 83.96% was achieved with W = 3, P = 200, and T = 2, chosen\n\nFigure 4: Impact of dictionary size K with a 4x4 pooling grid (W =3) on CIFAR-10 test accuracy\n\n7\n\n20040080016004000Dictionary Size646872768084AccuracyPerformance on CIFAR-10Discriminative SPNLearned Pooling, Jia et al.K-means (tri.), white, Coates et al.Auto-encoder, raw, Coates et al.RBM, whitened, Coates et al.\fTable 3: Test accuracies on CIFAR-10.\n\nMethod\n\nDictionary\n\nAccuracy\n36.0%\n39.5%\n65.6%\n68.3%\n71.0%\n78.9%\n79.6 %\n80.0%\n82.0%\n83.11%\n83.96%\n\nLogistic Regression [24]\nSVM [5]\nSIFT [5]\nmcRBM [24]\nmcRBM-DBN [24]\nConvolutional RBM [10]\nK-means (Triangle) [10]\nHKDES [4]\n3-Layer Learned RF [12]\nLearned Pooling [20]\nDiscriminative SPN\n\n4000, 4x4 grid\n\n1600, 9x9 grid\n6000, 4x4 grid\n400, 7x7 grid\n\nTable 4: Comparison of average test accuracies on all folds of STL-10.\n\nMethod\n\n1-layer Vector Quantization [11]\n1-layer Sparse Coding [11]\n3-layer Learned Receptive Field [12]\nDiscriminative SPN\n\nAccuracy (\u00b1\u03c3)\n54.9% (\u00b1 0.4%)\n59.0% (\u00b1 0.8%)\n60.1% (\u00b1 1.0%)\n62.3% (\u00b1 1.0%)\n\nby validation set performance. This architecture achieves the highest published test accuracy on the\nCIFAR-10 dataset, remarkably using one \ufb01fth the number of features of the next best approach. We\ncompare top CIFAR-10 results in Table 3, highlighting the dictionary size of systems that use the\nfeature extraction from Coates et al. [10].\n\n4.2 Results on STL-10\n\nSTL-10 has larger 96x96 pixel images and less labeled data (5,000 training and 8,000 test) than\nCIFAR-10 [10]. The training set is mapped to ten prede\ufb01ned folds of 1,000 images. We experi-\nmented on the STL-10 dataset in a manner similar to CIFAR-10, ignoring the 105 items of unlabeled\ndata. Ten models were trained on the pre-speci\ufb01ed folds, and test accuracy is reported as an aver-\nage. With K=1600, G=8, W =4, P =10, and T =3 we achieved 62.3% (\u00b1 1.0% standard deviation\namong folds), the highest published test accuracy as of writing. Notably, this includes approaches\nthat make use of the unlabeled training images. Like Coates and Ng [12], our architecture learns\nlocal relations among different feature maps. However, the SPN is able to discriminatively learn\nlatent mixtures, which can encode a more nuanced decision boundary than the linear classi\ufb01er used\nin their work. After we carried out our experiments, Bo et al. [6] reported a higher accuracy with\ntheir unsupervised features and a linear SVM. Just as with the features of Coates et al. [10], we\nanticipate that using an SPN instead of the SVM would be bene\ufb01cial by learning spatial structure\nthat the SVM cannot model.\n\n5 Conclusion\n\nSum-product networks are a new class of probabilistic model where inference remains tractable de-\nspite high treewidth and many hidden layers. This paper introduced the \ufb01rst algorithms for learning\nSPNs discriminatively, using a form of backpropagation to compute gradients. Discriminative train-\ning allows for a wider variety of SPN architectures than generative training, because completeness\nand consistency do not have to be maintained over evidence variables. We proposed both \u201csoft\u201d\nand \u201chard\u201d gradient algorithms, using marginal inference in the \u201csoft\u201d case and MPE inference in\nthe \u201chard\u201d case. The latter successfully combats the diffusion problem, allowing deep networks to\nbe learned. Experiments on image classi\ufb01cation benchmarks illustrate the power of discriminative\nSPNs.\nFuture research directions include applying other discriminative learning paradigms to SPNs (e.g.\nmax-margin methods), automatically learning SPN structure, and applying discriminative SPNs to\na variety of structured prediction problems.\nAcknowledgments: This research was partly funded by ARO grant W911NF-08-1-0242, AFRL\ncontract FA8750-09-C-0181, NSF grant IIS-0803481, and ONR grant N00014-12-1-0312. The\nviews and conclusions contained in this document are those of the authors and should not be inter-\npreted as necessarily representing the of\ufb01cial policies, either expressed or implied, of ARO, AFRL,\nNSF, ONR, or the United States Government.\n\n8\n\n\fReferences\n[1] M. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure.\n\nCVPR, 2012.\n\n[2] F. Bach and M.I. Jordan. Thin junction trees. Advances in Neural Information Processing Systems,\n\n14:569\u2013576, 2002.\n\n[3] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1\u2013127,\n\n2009.\n\n[4] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Computer\n\nVision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1729\u20131736. IEEE, 2011.\n\n[5] L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. Advances in Neural Information\n\nProcessing Systems, 2010.\n\n[6] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for RGB-D based object recognition. ISER,\n\n2012.\n\n[7] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in bayesian\nnetworks. In Proceedings of the Twelfth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 115\u2013\n123, 1996.\n\n[8] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial Intelli-\n\ngence, 172(6-7):772\u2013799, 2008.\n\n[9] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees. In J.C. Platt, D. Koller,\nY. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press,\nCambridge, MA, 2008.\n\n[10] A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nIn aistats11. Society for Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[11] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector\n\nquantization. In International Conference on Machine Learning, volume 8, page 10, 2011.\n\n[12] A. Coates and A.Y. Ng. Selecting receptive \ufb01elds in deep networks. NIPS, 2011.\n[13] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with\nperceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Lan-\nguage Processing, pages 1\u20138, Philadelphia, PA, 2002. ACL.\n\n[14] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50:280\u2013\n\n305, 2003.\n\n[15] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.\n[16] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In Proceedings of the 25th Confer-\n\nence on Neural Information Processing Systems, 2011.\n\n[17] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[18] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\nmodel. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1\u20138.\nIeee, 2008.\n\n[19] A. Hyv\u00a8arinen and E. Oja. Independent component analysis: algorithms and applications. Neural net-\n\nworks, 13(4-5):411\u2013430, 2000.\n\n[20] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive \ufb01eld learning for pooled image\n\nfeatures. In CVPR, 2012.\n\n[21] A. Kulesza, F. Pereira, et al. Structured learning with approximate inference. Advances in Neural Infor-\n\nmation Processing Systems, 20:785\u2013792, 2007.\n\n[22] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\nand labeling data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages\n282\u2013289, Williamstown, MA, 2001. Morgan Kaufmann.\n\n[23] H. Poon and P. Domingos. Sum-product networks: A new deep architecture.\n\nUncertainty in Arti\ufb01cial Intelligence, pages 337\u2013346, 2011.\n\nIn Proc. 12th Conf. on\n\n[24] M.A. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order\nBoltzmann machines. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,\npages 2551\u20132558. IEEE, 2010.\n\n[25] J. Saloj\u00a8arvi, K. Puolam\u00a8aki, and S. Kaski. Expectation maximization algorithms for conditional likeli-\nhoods. In Proceedings of the 22nd international conference on Machine learning, pages 752\u2013759. ACM,\n2005.\n\n9\n\n\f", "award": [], "sourceid": 4516, "authors": [{"given_name": "Robert", "family_name": "Gens", "institution": null}, {"given_name": "Pedro", "family_name": "Domingos", "institution": null}]}