{"title": "Learning Higher-Order Graph Structure with Features by Structure Penalty", "book": "Advances in Neural Information Processing Systems", "page_first": 253, "page_last": 261, "abstract": "In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y \u2019s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to mul- tivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure.", "full_text": "Learning Higher-Order Graph Structure with\n\nFeatures by Structure Penalty\n\nShilin Ding1\u2217, Grace Wahba1,2,3\u2217, and Xiaojin Zhu2\u2217\n\nDepartment of {1Statistics, 2Computer Sciences, 3Biostatistics and Medical Informatics}\n\nUniversity of Wisconsin-Madison, WI 53705\n\n{sding, wahba}@stat.wisc.edu, jerryzhu@cs.wisc.edu\n\nAbstract\n\nIn discrete undirected graphical models, the conditional independence of node\nlabels Y is speci\ufb01ed by the graph structure. We study the case where there is\nanother input random vector X (e.g. observed features) such that the distribution\nP (Y | X) is determined by functions of X that characterize the (higher-order)\ninteractions among the Y \u2019s. The main contribution of this paper is to learn the\ngraph structure and the functions conditioned on X at the same time. We prove\nthat discrete undirected graphical models with feature X are equivalent to mul-\ntivariate discrete models. The reparameterization of the potential functions in\ngraphical models by conditional log odds ratios of the latter offers advantages\nin representation of the conditional independence structure. The functional spaces\ncan be \ufb02exibly determined by kernels. Additionally, we impose a Structure Lasso\n(SLasso) penalty on groups of functions to learn the graph structure. These groups\nwith overlaps are designed to enforce hierarchical function selection. In this way,\nwe are able to shrink higher order interactions to obtain a sparse graph structure.\n\n1\n\nIntroduction\n\nIn undirected graphical models (UGMs), a graph is de\ufb01ned as G = (V, E), where V = {1, \u00b7 \u00b7 \u00b7 , K}\nis the set of nodes and E \u2282 V \u00d7 V is the set of edges between the nodes. The graph structure spec-\ni\ufb01es the conditional independence among nodes. Much prior work has focused on graphical model\nstructure learning without conditioning on X. For instance, Meinshausen and B\u00a8uhlmann [1] and\nPeng et al. [2] studied sparse covariance estimation of Gaussian Markov Random Fields. The co-\nvariance matrix fully determines the dependence structure in the Gaussian distribution. But it is not\nthe case for non-elliptical distributions, such as the discrete UGMs. Ravikumar et al. [3] and H\u00a8o\ufb02ing\nand Tibshirani [4] studied variable selection of Ising models based on l1 penalty. Ising models are\nspecial cases of discrete UGMs with (usually) only pairwise interactions, and without features. We\nfocused on discrete UGMs with both higher order interactions and features. It is important to note\nthat the graph structure may change conditioned on different X\u2019s, thus our approach may lead to\nbetter estimates and interpretation.\n\nIn addressing the problem of structure learning with features, Liu et al. [5] assumed Gaussian dis-\ntributed Y given X, and they partitioned the space of X into bins. Schmidt et al. [6] proposed a\nframework to jointly learn pairwise CRFs and parameters with block-l1 regularization. Bradley and\nGuestrin [7] learned tree CRF that recovers a max spanning tree of a complete graph based on heuris-\ntic pairwise link scores. These methods utilize only pairwise information to scale to large graphs.\nThe closest work is Schmidt and Murphy [8], which examined the higher-order graphical structure\n\n\u2217SD wishes to acknowledge the valuable comments from Stephen J. Wright and Sijian Wang. Research of\nSD and GW is supported in part by NIH Grant EY09946, NSF Grant DMS-0906818 and ONR Grant N0014-\n09-1-0655. Research of XZ is supported in part by NSF IIS-0953219, IIS-0916038.\n\n1\n\n\flearning problem without considering features. They used an active set method to learn higher order\ninteractions in a greedy manner. Their model is over-parameterized, and the hierarchical assumption\nis suf\ufb01cient but not necessary for conditional independence in the graph.\n\nTo the best of our knowledge, no previous work addressed the issue of graph structure learning of\nall orders while conditioning on input features. Our contributions include a reparemeterization of\nUGMs with bivariate outcomes into multivariate Bernoulli (MVB) models. The set of conditional\nlog odds ratios in MVB models are complete to represent the effects of features on responses and\ntheir interactions at all levels. The sparsity in the set of functions are suf\ufb01cient and necessary for the\nconditional independence in the graph, i.e., two nodes are conditionally independent iff the pairwise\ninteraction is constant zero; and the higher order interaction among a subset of nodes means none of\nthe variables is separable from the others in the joint distribution.\n\nTo obtain a sparse graph structure, we impose Structure Lasso (SLasso) penalty on groups of func-\ntions with overlaps. SLasso can be viewed as group lasso with overlaps. Group lasso [9] leads to\nselection of variables in groups. Jacob et al. [10] considered the penalty on groups with arbitrary\noverlaps. Zhao et al. [11] set up the general framework for hierarchical variable selection with over-\nlapping groups, which we adopt here for the functions. Our groups are designed to shrink higher\norder interactions similar to hierarchical inclusion restriction in Schimdt and Murphy [8]. We give\na proximal linearization algorithm that ef\ufb01ciently learns the complete model. Global convergence is\nguaranteed [12]. We then propose a greedy search algorithm to scale our method up to large graphs\nas the number of parameters grows exponentially.\n\n2 Conditional Independence in Discrete Undirected Graphical Models\n\nIn this section, we \ufb01rst discuss the relationship between the multivariate Bernoulli (MVB) model\nand the UGM whose nodes are binary, i.e. Yi = 0 or 1. At the end, we will give the representation\nof the general discrete UGM where Yi takes value in {0, \u00b7 \u00b7 \u00b7 , m \u2212 1}. In UGMs, the distribution of\nmultivariate discrete random variables Y1, . . . , YK given X is:\n\nP (Y1 = y1, . . . , YK = yK|X) =\n\n\u03a6C(yC; X)\n\n(1)\n\n1\n\nZ(X) YC\u2208C\n\nwhere Z(X) is the normalization factor. The distribution is factorized according to the cliques in\nthe graph. A clique C \u2286 \u2126 = {1, . . . , K} is the set of nodes that are fully connected. \u03a6C(yC; X) is\nthe potential function on C, indexed by yC = (yi)i\u2208C. This factorization follows from the Markov\nproperty: any two nodes not in a clique are conditionally independent given others [13]. So C does\nnot have to comply with the graph structure, as long as it is suf\ufb01cient. For example, the most general\nchoice for any given graph is C = {\u2126}. See Theorem 2.1 and Example 2.1 for details.\n\n(a) Graph 1\n\n(b) Graph 2\n\n(c) Graph 3\n\n(d) Graph 4\n\nFigure 1: Graphical model examples.\n\nGiven the graph structure, the potential functions characterize the distribution on the graph. But if\nthe graph is unknown in advance, estimating the potential functions on all possible cliques tends\nto be over-parameterized [8]. Furthermore, log \u03a6C(yC; X) = 0 is suf\ufb01cient for the conditional\nindependence among the nodes but not necessary (see Example 2.1). To avoid these problems, we\nintroduce the MVB model that is equivalent to (1) with binary nodes, i.e. Yi = 0 or 1. The MVB\ndistribution is:\n\nP (Y1 = y1, . . . , YK = yk|X = x) = exp(cid:8) X\u03c9\u2208\u03a8K\n= exp(cid:8)y1f 1(x) + \u00b7 \u00b7 \u00b7 + yKf K (x) + \u00b7 \u00b7 \u00b7 + y1y2f 1,2(x) + \u00b7 \u00b7 \u00b7 + y1 . . . yKf 1,...,K (x) \u2212 b(f )(cid:9)\n\ny\u03c9f \u03c9 \u2212 b(f )(cid:9)\n\n(2)\n\n2\n\n\fHere, we use the following notations. Let \u03a8K be the power set of \u2126 = {1, . . . , K}, and\nuse \u03a8K = \u03a8K \u2212 {\u2205} to index the 2K \u2212 1 f \u03c9\u2019s in (2). Let \u03c9 denotes a set in \u03a8K, de-\n\n\ufb01ne Y = (y1, \u00b7 \u00b7 \u00b7 , y\u03c9, \u00b7 \u00b7 \u00b7 , y\u2126) be the augmented response with y\u03c9 = Qi\u2208\u03c9 yi. And f =\n\n(f 1, . . . , f \u03c9, . . . , f \u2126) is the vector of conditional log odds ratios [14]. We assume f \u03c9 is in a Repro-\nducing Kernel Hilbert Space (RKHS) H\u03c9 with kernel K \u03c9 [15]. For example, in our simulation we\nchoose f \u03c9 to be B-spline (see supplementary mateiral). We focus on estimating the set of f \u03c9(x)\nwith feature x where the sparsity in the set speci\ufb01es the graph structure.\nWe present the following lemma and theorem which show the equivalence between UGM and MVB:\nLemma 2.1. In a MVB model, de\ufb01ne the odd-even partition of the power set of \u03c9 as: \u03a8\u03c9\nodd = {\u03ba \u2286\n\u03c9 | |\u03ba| = |\u03c9| \u2212 k, where k is odd}, and \u03a8\u03c9\neven = {\u03ba \u2286 \u03c9 | |\u03ba| = |\u03c9| \u2212 k, where k is even}. Note\n|\u03a8\u03c9\n\nodd| = |\u03a8\u03c9\n\neven| = 2|\u03c9|\u22121. The following property holds:\n\nf \u03c9 = log Q\u03ba\u2208\u03a8\u03c9\nQ\u03ba\u2208\u03a8\u03c9\n\neven\n\nodd\n\nP (Yi = 1, i \u2208 \u03ba; Yj = 0, j \u2208 \u2126\\\u03ba|X)\nP (Yi = 1, i \u2208 \u03ba; Yj = 0, j \u2208 \u2126\\\u03ba|X)\n\n,\n\nb(f ) = log\n\nZ(x)\n\nQC\u2208C \u03a6C(0; x)\n\n(3)\n\nTheorem 2.1. A UGM of the general form (1) with binary nodes is equivalent to a MVB model of\n(2). In addition, the following are equivalent: 1) There is no |C|-order interaction in {Yi, i \u2208 C};\n2) There is no clique C \u2208 \u03a8K in the graph; 3) f \u03c9 = 0 for all \u03c9 such that C \u2286 \u03c9.\n\nA proof is given in Appendix.\nIt states that there is a clique C in the graph, iff there is \u03c9 \u2287\nC, f \u03c9 6= 0 in MVB model. The advantage of modeling by MVB is that the sparsity in f \u03c9\u2019s is\nsuf\ufb01cient and necessary for the conditional independence in the graph, thus fully specifying the\ngraph structure. Speci\ufb01cally, Yi, Yj are conditionally independent iff f \u03c9 = 0, \u03c9 \u2287 {i, j}. This\nshowed the interaction is non-zero iff all the nodes involved are not conditionally independent.\nExample 2.1. When K = 2, \u2126 = {1, 2}, C = {\u2126}, denote \u03a6\u2126(Y1 = 1, Y2 = 1; X) as \u03a611\nfor simplicity, then P (Y1 = 1, Y2 = 1|X) = 1\nZ \u03a611. De\ufb01ne \u03a610, \u03a601, \u03a600 similarly, then the\ndistribution with UGM parameterization is determined. The relation between UGM and MVB is\n\nf 1 = log\n\n\u03a610\n\u03a600\n\n,\n\nf 2 = log\n\n\u03a601\n\u03a600\n\n,\n\nf 1,2 = log\n\n\u03a611 \u00b7 \u03a600\n\u03a601 \u00b7 \u03a610\n\nNote, the independence between Y1 and Y2 implies: f 1,2 = 0 or \u03a611 \u00b7 \u03a600 = \u03a601 \u00b7 \u03a610. Therefore,\nf 1,2 being zero in MVB model is suf\ufb01cient and necessary for the conditional independence in the\nmodel. On the other hand, log \u03a6C = 0 is a suf\ufb01cient condition but not necessary.\n\nThe distribution of a general discrete UGM where Yk \u2208 {0, \u00b7 \u00b7 \u00b7 , m \u2212 1} can be extended from (2).\nLemma 2.2. Let V = {1, . . . , m \u2212 1}, y\u03c9 = (yi)i\u2208\u03c9, then\n\nP (Y1 = y1, \u00b7 \u00b7 \u00b7 , YK = yK|X) = exp(cid:8)\n\n\u2126\n\nX\u03c9=1 Xv\u2208V |\u03c9|\n\nI(y\u03c9 = v)f \u03c9\n\nv \u2212 b(f )(cid:9)\n\n(4)\n\nwhere I is an indicator function and V n is the tensor product of n V \u2019s. Each f \u03c9 is a |V ||\u03c9| vector.\n\n3 Structure Penalty\n\nIn many applications, the assumption is that the graph has very few large cliques. Similar to the\nhierarchical inclusion restriction in Schmidt and Murphy [8], we will include a higher order inter-\naction only when all its subsets are included. Our model is very \ufb02exible in that f \u03c9(x) can be in an\narbitrary RKHS.\nLet y(i) = (y1(i), . . . , yK (i)), x(i) = (x1(i), . . . , xp(i)) be the ith data point. There are |\u03a8K| =\n2K \u2212 1 functions in total. We \ufb01rst consider learning the full model when K is small, and later\npropose a greedy search algorithm to scale to large graphs. The penalized log likelihood model is:\n\nmin I\u03bb(f ) = L(f ) + \u03bbJ(f ) =\n\nn\n\nXi=1(cid:16) \u2212 Y(i)T f (x(i)) + b(f )(cid:17) + \u03bbJ(f )\n\n(5)\n\n3\n\n\fwhere L(f ) is the negative log likelihood and J(\u00b7) is the structure penalty. The hierarchical assump-\ntion is that if there is no interaction on clique C, then all f \u03c9 should be zero, for \u03c9 \u2287 C. The penalty\nis designed to shrink such f \u03c9 toward zero. We consider the Structure Lasso (SLasso) penalty guided\nby the lattice in Figure 2. The lattice T has 2K \u2212 1 nodes: 1, . . . , \u03c9, . . . , \u2126. There is an edge from\n\u03c91 to \u03c92 if and only if \u03c91 \u2282 \u03c92 and |\u03c91| + 1 = |\u03c92|. Jenatton et al. [16] discussed how to de\ufb01ne the\ngroups to achieve different nonzero patterns.\n\nFigure 2: Hierarchical lattice for penalty\n\nLet Tv = {\u03c9 \u2208 \u03a8K|v \u2286 \u03c9} be the subgraph rooted at v in T , including all the descendants\nof v. Denote f Tv = (f \u03c9)\u03c9\u2208Tv . All the functions are categorized into groups with overlaps as\nH\u03c9 where pv is\n\n(T1, . . . , T\u2126). The SLasso penalty on the group Tv is: J(f Tv ) = pvqP\u03c9\u2208Tv kf \u03c9k2\n\nthe weight for the penalty on Tv, empirically chosen as\n\n|Tv| . Then, the objective is:\n\n1\n\nmin\n\nf\n\nI\u03bb(f ) = L(f ) + \u03bbXv\n\npvsX\u03c9\u2208Tv\n\nkf \u03c9k2\n\nH\u03c9\n\n(6)\n\nThe following theorem shows that by minimizing the objective (6), f \u03c91 will enter the model before\nf \u03c92 if \u03c91 \u2282 \u03c92. That is to say, if f \u03c91 is zero, there will be no higher order interactions on \u03c92. It is\nan extension of Theorem 1 in Zhao et al. [11] and the proof is given in Appendix.\nTheorem 3.1. Objective (6) is convex, thus the minimal is attainable. Let \u03c91, \u03c92 \u2208 \u03a8K and \u03c91 \u2282\n\u03c92. If \u02c6f is the minimizer of (6) given the observations, that is, 0 \u2208 \u2202I\u03bb( \u02c6f ) which is the subgradient\nof I\u03bb at \u02c6f, then \u02c6f \u03c92 = 0 almost surely if \u02c6f \u03c91 = 0.\nExample 3.1. If K = 3, f = (f 1, f 2, f 3, f 1,2, f 1,3, f 2,3, f 1,2,3). The group at node 1 in Figure 2\n\nis f T1 = (f 1, f 1,2, f 1,3, f 1,2,3) and J(f T1 ) = p1pkf 1k2 + kf 1,2k2 + kf 1,3k2 + kf 1,2,3k2.\n\n4 Parameter Estimation\n\n1 for simplicity. {1} refers to the constant function space, and H\u03c9\n\nIn this section, we discuss parameter estimation where the \u03c9th function space is linear as H\u03c9 =\n1 is a RKHS with a linear\n{1} \u2295 H\u03c9\nkernel. The functions in H\u03c9 have the form f \u03c9(x) = c\u03c9\nj xj. Its norm is kf \u03c9kH\u03c9 = kc\u03c9k,\np )T \u2208 Rp+1 as a vector\nwhere k \u00b7 k stands for Euclidean l2 norm. Here, we denote c\u03c9 = (c\u03c9\nof length p + 1 and c = (c\u03c9)\u03c9\u2208\u03a8K \u2208 R \u02dcp is the concatenated vector of all parameters of length\n\u02dcp = (p + 1) \u00b7 |\u03a8K|. Let cTv = (c\u03c9)\u03c9\u2208Tv be a (p + 1) \u00b7 |T v| vector, then the objective (6) is now:\n\n0 +Pp\n\n0 , . . . , c\u03c9\n\nj=1 c\u03c9\n\nmin\n\nc\n\nI\u03bb(c) = L(c) + \u03bbXv\n\npvkcTv k\n\n(7)\n\n4.1 Estimating the complete model on small graphs\n\nMany applications do not involve a large amount of responses, so it is desirable to learn the complete\nmodel when the graph is small for consistency reasons. We propose a method to optimize (7) of the\n\n4\n\n\fAlgorithm 1 Proximal Linearization Algorithm\n\nInput: c0, \u03b10, \u03b6 > 1, tol > 0\nrepeat\n\nChoose \u03b1k \u2208 [\u03b1min, \u03b1max]\nSolve Eq (8) for dk = c \u2212 ck\nwhile \u03b4k = I\u03bb(ck) \u2212 I\u03bb(ck + dk) < kdkk3 do\n\n// Insuf\ufb01cient decrease\nSet \u03b1k = max(\u03b1min, \u03b6\u03b1k)\nSolve Eq (8) for dk\n\nend while\nSet \u03b1k+1 = \u03b1k/\u03b6\nSet ck+1 = ck + dk\n\nuntil \u03b4k < tol\n\ncomplete model with all interaction levels by iteratively solving the following proximal linearization\nproblem as discussed in Wright [12]:\n\nmin\n\nc\n\nLk + \u2207LT\n\nk (c \u2212 ck) +\n\n\u03b1k\n2\n\nkc \u2212 ckk2 + \u03bbJ(c)\n\n(8)\n\nwhere Lk = L(ck), and \u03b1k is a positive scalar chosen adaptively at kth step. With slight abuse\nof notation, we denote ck as the value of c at kth step. Algorithm 1 summarized the framework of\nsolving (7). Following the analysis in Wright [12], we can ensure that the proximal linearization\nalgorithm will converge for the negative log-likelihood loss function with the SLasso penalty.\n\nHowever, solving group lasso with overlaps is not trivial due to the non-smoothness at the singular\npoint. In recent years, several papers have addressed this problem. Jacob et al. [10] duplicated the\ndesign matrix columns that appear in group overlaps, then solved the problem as group lasso without\noverlaps. Kim and Xing [17] reparameterized the group norm with additional dummy variables.\nThey alternatively optimized the model parameters and the dummy ones at each step. It is ef\ufb01cient\nfor the quadratic loss function on Gaussian data, but might not scale well in our case. Instead, we\nsolve (8) by its smooth and convex dual problem [18].The details are in the supplementary material.\n\n4.2 Estimating large graphs\n\nThe above algorithm is ef\ufb01cient on small graphs (K < 20). It usually terminates within 20 iterations\nin our experiments. However, the issue of estimating a complete model is the exponential number\nof f \u03c9\u2019s and the same amount of groups involved in objective (7). It is intractable when the graph\nbecomes large. The hierarchical assumption and the SLasso penalty lend themselves naturally to a\ngreedy search algorithm:\n\n1. Start from the set of main effects as A0 = {f 1, \u00b7 \u00b7 \u00b7 , f K}.\n2.\n\nIn step i, remove the nodes that are not in Ai from the lattice in Figure 2. Obtain a sparse\ni.\nestimation of the functions in Ai by algorithm (1). Denote the resulting sparse set A\u2032\n\n3. Let Ai+1 = A\u2032\n\ni. Keep adding a higher order interaction into Ai+1 if all its subsets of\n\ninteractions are included in A\u2032\n\ni. And also add this node into the lattice in Figure 2.\n\nIterate step 2 and 3 until convergence. The algorithm is similar to the active set method in Schmidt\nand Murphy [8]. It has multiple runs of algorithm (1) to enforce the hierarchical assumption. It is\nnot guaranteed to converge to the global optimum. Nonetheless, our empirical experiments show its\nability to scale to large graphs.\n\n5 Experiments\n\n5.1 Toy Data\n\nIn the simulation, we create 6 toy graphs. The \ufb01rst four graphs are depicted in Figure 1. Graph 5\nhas 100 nodes where the \ufb01rst 8 nodes have the same structure as in Figure 1(c) and the others are\nindependent. Graph 6 also has 100 nodes where the \ufb01rst 10 nodes have the same connection as in\nFigure 1(d) and the others are independent. We generate 100 datasets for each structure to evaluate\n\n5\n\n\fj=1 g\u03c9\n\nj (xj) where g\u03c9\n\nk=1 c\u03c9\n\n0 +P5\n\nj (xj) = PD\n\nthe performance. The sample size of each dataset is 1000. Here is how the \ufb01rst data set is generated:\nThe length of the feature vector, p, is set to 5 in our experiment, i.e. X = (X1, . . . , X5). Each\njkBk(xj) is spanned by the B-spline basis\nf \u03c9(x) = c\u03c9\nfunctions {Bk(\u00b7)}k=1,\u00b7\u00b7\u00b7 ,D (see the supplementary material), where D is chosen to be 5. The true\njk, is uniformly sampled from {\u22125, \u22124, \u00b7 \u00b7 \u00b7 , 5}. We set the intercepts\nset of the model parameters, c\u03c9\n0 in main effects to 1, and those in second or higher order interactions to 2. The features, Xj, are\nc\u03c9\ni.i.d uniform on [-1, 1]. Then, Y is sampled according to the probability in equation (2).\nWe use GACV (generalized approximate cross validation) and BGACV (B-type GACV) [19] to\nchoose the regularization parameter \u03bb for the complete model (graphs 1-4). We call these variants\nof SLasso Complete-GACV and Complete-BGACV. We use AIC for greedy search (Greedy-AIC)\nin graphs 5 and 6 due to computational consideration. The range of \u03bb is chosen according to Koh\net al. [20]. The details of the tuning methods are discussed in the supplementary material. The R\npackage, BMN, is used as a baseline [4].\n\nTable 1: Number of true positive and false positive functions\n\nGraph Method\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nBMN\nComplete-GACV\nComplete-BGACV\nBMN\nComplete-GACV\nComplete-BGACV\nBMN\nComplete-GACV\nComplete-BGACV\nBMN\nComplete-GACV\nComplete-BGACV\nBMN\nGreedy-AIC\nBMN\nGreedy-AIC\n\nf 1,2\n60\n100\n86\n44\n100\n88\n72\n91\n36\n48\n92\n68\n38\n99\n28\n100\n\nf 1,3\n76\n100\n83\n50\n99\n91\n64\n87\n22\n34\n98\n68\n28\n99\n26\n100\n\nf 2,3\n70\n100\n83\n38\n100\n88\n60\n81\n23\n37\n94\n71\n26\n98\n14\n100\n\nf 3,4\n60\n94\n72\n58\n99\n78\n60\n92\n93\n29\n90\n62\n22\n97\n26\n99\n\nf 1,2,3\n\n0\n84\n14\n0\n83\n33\n0\n62\n0\n0\n54\n0\n0\n22\n0\n24\n\nf 5,7,8\n\n-\n-\n-\n-\n-\n-\n0\n71\n39\n0\n45\n0\n0\n21\n0\n15\n\nf 5,6,7,8\n\n-\n-\n-\n-\n-\n-\n0\n33\n0\n-\n-\n-\n0\n0\n-\n-\n\nFP\n162\n136\n11\n412\n341\n64\n830\n412\n162\n774\n693\n144\n9476\n1997\n9672\n3458\n\nIn Table 1, we count, for each function f \u03c9, the number of runs out of 100 where f \u03c9 is recovered\n(kc\u03c9k 6= 0). If a recovered function is in the true model, it is considered a true positive, otherwise a\nfalse positive. The main effects are always detected correctly, thus are not listed in the table. SLasso\nis more effective compared to BMN which only considers pairwise interactions.\n\nIn Figure 3, we show the learning results in terms of true positive rate (TPR) as sample size increases\nfrom 100 to 1000. The experimental setting is the same as before. The TPRs improve with increas-\ning sample size. GACV achieves better TPR, but higher FPR compared to BGACV. Our method\noutperforms BMN in all six graphs.\n\n5.2 Case Study: Census Bureau County Data\n\nWe use the county data from U.S. Census Bureau1 to validate our method. We remove the counties\nthat have missing values and obtain 2668 entries in total. The outcomes of this study are summarized\nin Table 2. \u201cVote\u201d [21] is coded as 1 if the Republican candidate won in the 2004 presidential\nelection. To dichotomize the remaining outcomes, the national mean is selected as a threshold. The\ndata is standardized to mean 0 and variance 1. The following features are included: Housing unit\nchange in percent from 2000-2006, percent of ethnic groups, percent foreign born, percent people\nover 65, percent people under 18, percent people with a high school education, percent people\nwith a bachelors degree; birth rate, death rate, per capita government expenditure in dollars. By\nadjusting \u03bb, we observe new interactions enter the model. The graph structure of \u03bb = 0.1559 is\n\n1http://www.census.gov/statab/www/ccdb.html\n\n6\n\n\f0\n.\n1\n\n0\n.\n1\n\n0\n.\n1\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\n5\n.\n0\n\n0\n.\n1\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\nGACV\nBGACV\nBMN\n\n800\n\n1000\n\n200\n\n400\n\n600\n\nSample Size\n\n(a) Graph 1 (5%)\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\n5\n.\n0\n\n0\n.\n1\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSample Size\n\n(b) Graph 2 (5%)\n\nAIC\nBMN\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSample Size\n\n(c) Graph 3 (1%)\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\n5\n.\n0\n\n0\n.\n1\n\n9\n.\n0\n\n8\n.\n0\n\n7\n.\n0\n\n6\n.\n0\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n5\n.\n0\n\n200\n\n400\n\n600\n\nSample Size\n\n(d) Graph 4 (0.5%)\n\n5\n.\n0\n\n4\n.\n0\n\n800\n\n1000\n\n200\n\n5\n.\n0\n\n4\n.\n0\n\n800\n\n1000\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSample Size\n\n400\n\n600\n\nSample Size\n\n(e) Graph 5 (< 10\u221220)\n\n(f) Graph 6 (< 10\u221220)\n\nFigure 3: The True Positive Rate (TPR) of graph structure learning methods with increasing sam-\nple size. The percentage in the bracket is the upper bound of False Positive Rate (FPR) in each\nexperiment. BMN always has larger FPR compared to SLasso.\n\nTable 2: Selected response variables\n\nResponse Description\nVote\nPoverty\nVCrime\nPCrime\nURate\nPChange\n\n2004 votes for Republican presidential candidate\nPoverty Rate\nViolent Crime Rate, eg. murder, robbery\nProperty Crime Rate, eg. burglary\nUnemployment Rate\nPopulation change in percent from 2000 to 2006\n\nPositive%\n81.11\n52.70\n23.09\n6.82\n51.35\n64.96\n\nshown in Figure 4(a). The results of BMN (the tuning parameter is 0.015) is in Figure 4(b). The\nunemployment rate plays an important role as a hub as discovered by SLasso, but not by BMN.\n\n(a) SLasso-Complete\n\n(b) BMN\n\nFigure 4: Interactions of response variables in the Census Bureau data. The \ufb01rst number on the edge\nis the order at which the link is recovered. The number in bracket is the function norm on the clique\nand the absolute value of the elements in the concentration matrix, respectively. We note SLasso\ndiscovers at 7th step two third-order interactions which are displayed by two circles in (a).\n\nWe analyze the link between \u201cVote\u201d and \u201cPChange\u201d. Though the marginal correlation between\nthem (without X) is only 0.0389, which is the second lowest absolute pairwise correlation, the\n\n7\n\n\flink is \ufb01rstly recovered by SLasso. It has been suggested that there is indeed a connection2. This\nshows that after taking features into account, the dependence structure of response variables may\nchange and hidden relations could be discovered. The main factors in this case are \u201cpercentage of\nhousing unit change\u201d (X1) and \u201cpopulation percentage of people over 65\u201d (X2). The part of the\n\ufb01tted model shown below suggests that as housing units increase, the counties are more likely to\nhave both positive results for \u201cVote\u201d and \u201cPChange\u201d. But this tendency will be counteracted by the\nincrease of people over 65: the responses are less likely to take both positive values.\n\n\u02c6f V ote = 0.2913 \u00b7 X1 + 0.3475 \u00b7 X2 + \u00b7 \u00b7 \u00b7\n\u02c6f P Change = 1.4726 \u00b7 X1 \u2212 0.3709 \u00b7 X2 + \u00b7 \u00b7 \u00b7\n\u02c6f V ote,P Change = 0.1358 \u00b7 X1 \u2212 0.0458 \u00b7 X2 + \u00b7 \u00b7 \u00b7\n\n6 Conclusions\n\nOur SLasso method can learn the graph structure that is speci\ufb01ed by the conditional log odds ratios\nconditioned on input features X, which allows the graphical model depending on features. The\nmodeling interprets well, since f \u03c9 = 0 iff there is no such clique. An ef\ufb01cient algorithm is given\nto estimate the complete model. A greedy approach is applied when the graph is large. SLasso\ncan be extended to model a general discrete UGM, where Yk takes value in {0, . . . , m \u2212 1}. Also,\nthere exist rich selections of the function forms, which makes the model more \ufb02exible and powerful,\nthough modi\ufb01cation is needed in solving the proximal subproblem for non-parametric families.\n\nA Proof\n\nA.1 Proof of Theorem 2.1\n\nProof. Given UGM (1), the corresponding parameterization in MVB model is shown in (3) of\nLemma 2.1. Conversely, given the MVB model of (2), the cliques can be determined by the nonzero\nf \u03c9: clique C exists if C = \u03c9 and f \u03c9 6= 0. Then the maximal cliques can be inferred from the\ngraph structure. And suppose they are C1, . . . , Cm. Let \u03c9i = Ci, for i = 1, . . . , m, and \u03ba1 = \u2205,\n\u03bai = Ci \u2229 (Ci\u22121 \u222a \u00b7 \u00b7 \u00b7 \u222a C1), i = 2, . . . , m. Then the parameterization is:\n\nand Z(x) = exp(b(f ))\n\n(9)\n\n\u03a6Ci (yCi ; x) = exp(cid:0)S\u03c9i(y; x) \u2212 S\u03bai (y; x)(cid:1)\n\nwhere S\u03c9(y; x) =P\u03ba\u2286\u03c9 y\u03baf \u03ba(x). Thus, UGM (1) with bivariate nodes is equivalent to MVB (2).\n\nIn the latter part of the theorem, 1 \u21d2 2 and 3 \u21d2 1 follow naturally from the Markov property of\ngraphical models. To show 2 \u21d2 3, let y\u03c9\ni )i\u2208C where\nC = y\u03ba\u2032\ni = 1 if i \u2208 \u03c9 and y\u03c9\nC . For\ny\u03c9\nany possible v = \u03ba \u2229 C, \u03ba\u2032 \u2208 {\u03ba|\u03ba = v \u222a u, s.t. u \u2286 \u03c9 \u2212 v} will satisfy the condition: \u03ba\u2032 \u2229 C = v.\nThere are 2|\u03c9\u2212v| such \u03ba\u2032 in total due to the choice of u. Also, they appear in the nominator and\ndenominator of equation (3) equally. So, for any C \u2208 C,\n\nC = (y\u03c9\ni = 0 otherwise. Notice that whenever \u03ba\u2229C = \u03ba\u2032 \u2229C, we have y\u03ba\n\nC be a realization of yC such that y\u03c9\n\nIt follows that f \u03c9 = 0 by (3).\n\nY\u03ba\u2208\u03a8\u03c9\n\neven\n\n\u03a6C(y\u03ba\n\nC; x) = Y\u03ba\u2208\u03a8\u03c9\n\nodd\n\n\u03a6C(y\u03ba\n\nC; x)\n\n(10)\n\nA.2 Proof of Theorem 3.1\n\nProof. We give the proof for the linear case. The convexity of I\u03bb is easy to check, since L and\nJ(f Tv ) are all convex in c. Suppose there is some \u03c92 \u2283 \u03c91 s.t. \u02c6c\u03c92 6= 0 and \u02c6c\u03c91 = 0, by the groups\nconstructed through Figure 2, k\u02c6cTv k = k(\u02c6c\u03c9)v\u2286\u03c9k 6= 0 for all v \u2286 \u03c91. So the partial derivative of\nthe objective (7) with respect to c\u03c91 at \u02c6c\u03c91 is\n\nThus, the probability of {\u02c6c\u03c92 6= 0} equals to the probability of { \u2202L\n\n2http://www.ipsos-mori.com/researchpublications/researcharchive/2545/Analysis-Population-change-turnout-the-election.aspx\n\n\u2202L\n\n\u2202c\u03c91(cid:12)(cid:12)(cid:12)(cid:12)c\u03c91 =\u02c6c\u03c91\n\n+ \u03bb Xv\u2286\u03c91\n\npv\n\n\u02c6c\u03c91\nk\u02c6cTv k\n\n= 0\n\n(11)\n\n\u2202c\u03c91(cid:12)(cid:12)c\u03c91 =\u02c6c\u03c91 = 0}, which is 0.\n\n8\n\n\fReferences\n\n[1] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the lasso. The\n\nAnnals of Statistics, 34(3):1436\u20131462, 2006.\n\n[2] J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models.\n\nJournal of the American Statistical Association, 104(486):735\u2013746, 2009.\n\n[3] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[4] H. H\u00a8o\ufb02ing and R. Tibshirani. Estimation of sparse binary pairwise markov networks using pseudo-\n\nlikelihoods. The Journal of Machine Learning Research, 10:883\u2013906, 2009.\n\n[5] Han Liu, Xi Chen, John Lafferty, and Larry Wasserman. Graph-valued regression. In J. Lafferty, C. K. I.\nWilliams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Process-\ning Systems 23, pages 1423\u20131431. 2010.\n\n[6] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning in random \ufb01elds for heart motion\nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138,\n\nabnormality detection.\n2008.\n\n[7] J.K. Bradley and C. Guestrin. Learning tree conditional random \ufb01elds.\n\nInternational Conference on Machine learning, pages 127\u2013134, 2010.\n\nIn Proceedings of the 27th\n\n[8] M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials.\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n[9] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[10] L. Jacob, G. Obozinski, and J.P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the\n\n26th Annual International Conference on Machine Learning, pages 433\u2013440, 2009.\n\n[11] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n[12] S.J. Wright. Accelerated block-coordinate relaxation for regularized optimization. Technical report,\n\nDepartment of Computer Science, University of Wisconsin-Madison, 2010.\n\n[13] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends R(cid:13) in Machine Learning, 1:1\u2013305, 2008.\n\n[14] F. Gao, G. Wahba, R. Klein, and B. Klein. Smoothing Spline ANOVA for multivariate Bernoulli ob-\nservations, with application to ophthalmology data. Journal of the American Statistical Association,\n96(453):127, 2001.\n\n[15] G. Wahba. Spline Models for Observational Data. Society for Industrial Mathematics, 1990.\n[16] R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.\n\narXiv:0904.3523, 2009.\n\n[17] S. Kim and E.P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity.\n\nIn\nProceedings of 27th International Conference on Machine Learning, pages 543\u2013550, Haifa, Israel, 2010.\n\n[18] J. Liu and J. Ye. Fast overlapping group lasso. arXiv:1009.0306v1, 2010.\n[19] Xiwen Ma. Penalized Regression in Reproducing Kernel Hilbert Spaces With Randomized Covariate\n\nData. PhD thesis, Department of Statistics, University of Wisconsin-Madison, 2010.\n\n[20] K. Koh, S.J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression.\n\nJournal of Machine learning research, 8(8):1519\u20131555, 2007.\n\n[21] R.M. Scammon, A.V. McGillivray, and R. Cook. America Votes 26: 2003-2004, Election Returns By\n\nState. CQ Press, 2005.\n\n9\n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "Shilin", "family_name": "Ding", "institution": null}, {"given_name": "Grace", "family_name": "Wahba", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}]}