{"title": "A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation", "book": "Advances in Neural Information Processing Systems", "page_first": 1250, "page_last": 1258, "abstract": "A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lovasz extension to obtain a convex relaxation. For tractable classes such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.", "full_text": "A Convex Formulation for Learning Scale-Free\n\nNetworks via Submodular Relaxation\n\nAaron J. Defazio\n\nNICTA/Australian National University\n\nCanberra, ACT, Australia\n\naaron.defazio@anu.edu.au\n\ntiberio.caetano@nicta.com.au\n\nTiberio S. Caetano\n\nNICTA/ANU/University of Sydney\n\nCanberra and Sydney, Australia\n\nAbstract\n\nA key problem in statistics and machine learning is the determination of network\nstructure from data. We consider the case where the structure of the graph to be\nreconstructed is known to be scale-free. We show that in such cases it is natural\nto formulate structured sparsity inducing priors using submodular functions, and\nwe use their Lov\u00b4asz extension to obtain a convex relaxation. For tractable classes\nsuch as Gaussian graphical models, this leads to a convex optimization problem\nthat can be ef\ufb01ciently solved. We show that our method results in an improvement\nin the accuracy of reconstructed networks for synthetic data. We also show how\nour prior encourages scale-free reconstructions on a bioinfomatics dataset.\n\nIntroduction\n\nStructure learning for graphical models is a problem that arises in many contexts. In applied statis-\ntics, undirected graphical models can be used as a tool for understanding the underlying conditional\nindependence relations between variables in a dataset. For example, in bioinfomatics Gaussian\ngraphical models are \ufb01tted to data resulting from micro-array experiments, where the \ufb01tted graph\ncan be interpreted as a gene expression network [9].\nIn the context of Gaussian models, the structure learning problem is known as covariance selec-\ntion [8]. The most common approach is the application of sparsity inducing regularization to the\nmaximum likelihood objective. There is a signi\ufb01cant body of literature, more than 30 papers by\nour count, on various methods of optimizing the L1 regularized covariance selection objective alone\n(see the recent review by Scheinberg and Ma [17]).\nRecent research has seen the development of structured sparsity, where more complex prior knowl-\nedge about a sparsity pattern can be encoded. Examples include group sparsity [22], where parame-\nters are linked so that they are regularized in groups. More complex sparsity patterns, such as region\nshape constraints in the case of pixels in an image [13], or hierarchical constraints [12] have also\nbeen explored.\nIn this paper, we study the problem of recovering the structure of a Gaussian graphical model under\nthe assumption that the graph recovered should be scale-free. Many real-world networks are known\na priori to be scale-free and therefore enforcing that knowledge through a prior seems a natural\nidea. Recent work has offered an approach to deal with this problem which results in a non-convex\nformulation [14]. Here we present a convex formulation. We show that scale-free networks can be\ninduced by enforcing submodular priors on the network\u2019s degree distribution, and then using their\nconvex envelope (the Lov\u00b4asz extension) as a convex relaxation [2]. The resulting relaxed prior has an\ninteresting non-differentiable structure, which poses challenges to optimization. We outline a few\noptions for solving the optimisation problem via proximal operators [3], in particular an ef\ufb01cient\ndual decomposition method. Experiments on both synthetic data produced by scale-free network\nmodels and a real bioinformatics dataset suggest that the convex relaxation is not weak: we can\ninfer scale-free networks with similar or superior accuracy than in [14].\n\n1\n\n\f1 Combinatorial Objective\n\nConsider an undirected graph with edge set E and node set V , where n is the number of nodes. We\ndenote the degree of node v as dE(v), and the complete graph with n nodes as Kn. We are concerned\nwith placing priors on the degree distributions of graphs such as (V, E). By degree distribution, we\nmean the bag of degrees {dE(v)|v \u2208 V }.\nA natural prior on degree distributions can be formed from the family of exponential random graphs\n[21]. Exponential random graph (ERG) models assign a probability to each n node graph using an\nexponential family model. The probability of each graph depends on a small set of suf\ufb01cient statis-\ntics, in our case we only consider the degree statistics. A ERG distribution with degree parametriza-\ntion takes the form:\n\np(G = (V, E); h) \u2248 1\nZ(h)\n\nexp\n\nh(dE(v))\n\n,\n\n(1)\n\n(cid:34)\n\n\u2212(cid:88)\n\nv\u2208V\n\n(cid:35)\n\nThe degree weighting function h : Z+ \u2192 R encodes the preference for each particular degree. The\nfunction Z is chosen so that the distribution is correctly normalized over n node graphs.\nA number of choices for h are reasonable; A geometric series h(i) \u221d 1 \u2212 \u03b1i with \u03b1 \u2208 (0, 1) has\nbeen proposed by Snijders et al. [20] and has been widely adopted. However for encouraging scale\nfree graphs we require a more rapidly increasing sequence. It is instructive to observe that, under\nthe strong assumption that each node\u2019s degree is independent of the rest, h grows logarithmically.\nTo see this, take a scale free model with scale \u03b1; the joint distribution takes the form:\n\np(G = (V, E); \u0001, \u03b1) \u2248\n\n1\n\n(dE(v) + \u0001)\u2212\u03b1,\n\n(cid:89)\n\nZ(\u0001, \u03b1)\n\nv\u2208V\n\nwhere \u0001 > 0 is added to prevent in\ufb01nite weights. Putting this into ERG form gives the weight\nsequence h(i) = \u03b1 log(i + \u0001). We will consider this and other functions h in Section 4. We intend\nto perform maximum a posteriori (MAP) estimation of a graph structure using such a distribution as\na prior, so the object of our attention is the negative log-posterior, which we denote F :\n\nF (E) =\n\nh(dE(v)) + const.\n\n(2)\n\n(cid:88)\n\nv\u2208V\n\nSo far we have de\ufb01ned a function on edge sets only, however in practice we want to optimize over\na weighted graph, which is intractable when using discontinuous functions such as F . We now\nconsider the properties of h that lead to a convex relaxation of F .\n\n2 Submodularity\nA set function F : 2E \u2192 R on E is a non-decreasing submodular function if for all A \u2282 B \u2282 E\nand x \u2208 E\\B the following conditions hold:\n\nF (A \u222a {x}) \u2212 F (A) \u2265 F (B \u222a {x}) \u2212 F (B)\n\nand F (A) \u2264 F (B).\n\n(submodularity)\n(non-decreasing)\n\nThe \ufb01rst condition can be interpreted as a diminishing returns condition; adding x to a set A increases\nF (A) by more than adding it to a larger set B, if B contains A.\nWe now consider a set of conditions that can be placed on h so that F is submodular.\nProposition 1. Denote h as tractable if h is non-decreasing, concave and h(0) = 0. For tractable\nh, F is a non-decreasing submodular function.\n\nProof. First note that the degree function is a set cardinality function, and hence modular. A concave\ntransformation of a modular function is submodular [1], and the sum of submodular functions is\nsubmodular.\n\n2\n\n\fThe concavity restriction we impose on h is the key ingredient that allows us to use submodularity\nto enforce a prior for scale-free networks; any prior favouring long tailed degree distributions must\nplace a lower weight on new edges joining highly connected nodes than on those joining other nodes.\nAs far as we are aware, this is a novel way of mathematically modelling the \u2018preferential attachment\u2019\nrule [4] that gives rise to scale-free networks: through non-decreasing submodular functions on the\ndegree distribution.\nLet X denote a symmetric matrix of edge weights. A natural convex relaxation of F would be\nthe convex envelope of F (Supp(X)) under some restricted domain. For tractable h, we have by\nconstruction that F satis\ufb01es the conditions of Proposition 1 in [2], so that the convex envelope\nof F (Supp(X)) on the L\u221e ball is precisely the Lov\u00b4asz extension evaluated on |X|. The Lov\u00b4asz\nextension for our function is easy to determine as it is a sum of \u201cfunctions of cardinality\u201d which are\nconsidered in [2]. Below is the result from [2] adapted to our problem.\nProposition 2. Let Xi,(j) be the weight of the jth edge connected to i, under a decreasing ordering\nby absolute value (i.e |Xi,(0)| \u2265 |Xi,(1)| \u2265 ... \u2265 |Xi,(n\u22121)|). The notation (i) maps from sorted\norder to the natural ordering, with the diagonal not included. Then the convex envelope of F for\ntractable h over the L\u221e norm unit ball is:\n\nn(cid:88)\n\nn\u22121(cid:88)\n\n\u2126(X) =\n\n(h(k + 1) \u2212 h(k))|Xi,(k)|.\n\nk=0\nThis function is piece-wise linear and convex.\n\ni=0\n\nThe form of \u2126 is quite intuitive. It behaves like a L1 norm with an additional weight on each edge\nthat depends on how the edge ranks with respect to the other edges of its neighbouring nodes.\n\n3 Optimization\n\nWe are interested in using \u2126 as a prior, for optimizations of the form\nminimizeX f (X) = g(X) + \u03b1\u2126(X),\n\nfor convex functions g and prior strength parameters \u03b1 \u2208 R+, over symmetric X. We will focus\non the simplest structure learning problem that occurs in graphical model training, that of Gaussian\nmodels. In which case we have\n\ng(X) = (cid:104)X, C(cid:105) \u2212 log det X,\n\nwhere C is the observed covariance matrix of our data. The support of X will then be the set\nof edges in the undirected graphical model together with the node precisions. This function is a\nrescaling of the maximum likelihood objective. In order for the resulting X to de\ufb01ne a normalizable\ndistribution, X must be restricted to the cone of positive de\ufb01nite matrices. This is not a problem\nin practice as g(X) is in\ufb01nite on the boundary of the PSD cone, and hence the constraint can be\nhandled by restricting optimization steps to the interior of the cone. In fact X can be shown to be\nin a strictly smaller cone, X\u2217 (cid:23) aI, for a derivable from C [15]. This restricted domain is useful\nas g(X) has Lipschitz continuous gradients over X (cid:23) aI but not over all positive de\ufb01nite matrices\n[18].\nThere are a number of possible algorithms that can be applied for optimizing a convex non-\ndifferentiable objective such as f. Bach [2] suggests two approaches to optimizing functions in-\nvolving submodular relaxation priors; a subgradient approach and a proximal approach.\nSubgradient methods are the simplest class of methods for optimizing non-smooth convex functions.\nThey provide a good baseline for comparison with other methods. For our objective, a subgradient\nis simple to evaluate at any point, due to the piecewise continuous nature of \u2126(X). Unfortunately\n(primal) subgradient methods for our problem will not return sparse solutions except in the limit of\nconvergence. They will instead give intermediate values that oscillate around their limiting values.\nAn alternative is the use of proximal methods [2]. Proximal methods exhibit superior convergence\nin comparison to subgradient methods, and produce sparse solutions. Proximal methods rely on\nsolving a simpler optimization problem, known as the proximal operator at each iteration:\n\n(cid:20)\n\n(cid:21)\n\n(cid:107)X \u2212 Z(cid:107)2\n\n2\n\n1\n2\n\n,\n\narg min\n\n\u03b1\u2126(X) +\n\nX\n\n3\n\n\fwhere Z is a variable that varies at each iteration. For many problems of interest, the proximal\noperator can be evaluated using a closed form solution. For non-decreasing submodular relaxations,\nthe proximal operator can be evaluated by solving a submodular minimization on a related (not\nnecessarily non-decreasing) submodular function [2].\nBach [2] considers several example problems where the proximal operator can be evaluated using\nfast graph cut methods. For the class of functions we consider, graph-cut methods are not applicable.\nGeneric submodular minimization algorithms could be as slow as O(n12) for a n-vertex graph,\nwhich is clearly impractical [11]. We will instead propose a dual decomposition method for solving\nthis proximal operator problem in Section 3.2.\nFor solving our optimisation problem, instead of using the standard proximal method (sometimes\nknown as ISTA), which involves a gradient step followed by the proximal operator, we propose to\nuse the alternating direction method of multipliers (ADMM), which has shown good results when\napplied to the standard L1 regularized covariance selection problem [18]. Next we show how to\napply ADMM to our problem.\n\n3.1 Alternating direction method of multipliers\n\nThe alternating direction method of multipliers (ADMM, Boyd et al. [6]) is one approach to opti-\nmizing our objective that has a number of advantages over the basic proximal method. Let U be the\nmatrix of dual variables for the decoupled problem:\n\nminimizeX g(X) + \u03b1\u2126(Y ),\n\ns.t. X = Y.\n\nFollowing the presentation of the algorithm in Boyd et al. [6], given the values Y (l) and U (l) from\niteration l, with U (0) = 0n and Y (0) = In the ADMM updates for iteration l + 1 are:\n\n(cid:104)(cid:104)X, C(cid:105) \u2212 log det X +\n(cid:104)\n\n\u03c1\n2\n\n||X \u2212 Y (l) + U (l)||2\n\n2\n\n(cid:105)\n\n(cid:105)\n\nX (l+1) = arg min\n\nX\n\nY (l+1) = arg min\nU (l+1) = U (l) + X (l+1) \u2212 Y (l+1),\n\n\u03b1\u2126(Y ) +\n\nY\n\n||X (l+1) \u2212 Y + U (l)||2\n\n2\n\n\u03c1\n2\n\nwhere \u03c1 > 0 is a \ufb01xed step-size parameter (we used \u03c1 = 0.5). The advantage of this form is that\nboth the X and Y updates are a proximal operation. It turns out that the proximal operator for g (i.e.\nthe X (l+1) update) actually has a simple solution [18] that can be computed by taking an eigenvalue\ndecomposition QT \u039bQ = \u03c1(Y \u2212 U )\u2212 C, where \u039b = diag(\u03bb1, . . . , \u03bbn) and updating the eigenvalues\nusing the formula\n\n\u03bbi +(cid:112)\u03bb2\n\n2\u03c1\n\ni + 4\u03c1\n\n\u03bb(cid:48)\ni :=\n\nto give X = QT \u039b(cid:48)Q. The stopping criterion we used was ||X (l+1) \u2212 Y (l+1)|| < \u0001 and\n||Y (l+1) \u2212 Y (l)|| < \u0001. In practice the ADMM method is one of the fastest methods for L1 regular-\nized covariance selection. Scheinbert et al. [18] show that convergence is guaranteed if additional\ncone restrictions are placed on the minimization with respect to X, and small enough step sizes are\nused. For our degree prior regularizer, the dif\ufb01cultly is in computing the proximal operator for \u2126, as\nthe rest of the algorithm is identical to that presented in Boyd et al. [6]. We now show how we solve\nthe problem of computing the proximal operator for \u2126.\n\n3.2 Proximal operator using dual decomposition\n\nHere we describe the optimisation algorithm that we effectively use for computing the proximal\noperator. The regularizer \u2126 has a quite complicated structure due to the interplay between the\nterms involving the two end points for each edge. We can decouple these terms using the dual\ndecomposition technique, by writing the proximal operation for a given Z = Y \u2212 U as:\n\nn(cid:88)\n\nn\u22121(cid:88)\n\n(h(k + 1) \u2212 h(k))(cid:12)(cid:12)Xi,(k)\n\n(cid:12)(cid:12) +\n\n||X \u2212 Z||2\n\n2\n\n1\n2\n\nminimizeX =\n\n\u03b1\n\u03c1\n\ni\n\nk\n\ns.t. X = X T .\n\n4\n\n\fThe only difference so far is that we have made the symmetry constraint explicit. Taking the dual\ngives a formulation where the upper and lower triangle are treated as separate variables. The dual\nvariable matrix V corresponds to the Lagrange multipliers of the symmetry constraint, which for\nnotational convenience we store in an anti-symmetric matrix. The dual decomposition method is\ngiven in Algorithm 1.\n\nAlgorithm 1 Dual decomposition main\n\ninput: matrix Z, constants \u03b1, \u03c1\ninput: step-size 0 < \u03b7 < 1\ninitialize: X = Z\ninitialize: V = 0n\nrepeat\n\nfor l = 0 until n \u2212 1 do\n\nend for\nV = V + \u03b7(X \u2212 X T )\nuntil ||X \u2212 X T|| < 10\u22126\n2 (X + X T ) # symmetrize\nX = 1\nround: any |Xij| < 10\u221215 to 0\nreturn X\n\nXl\u2217 = solveSubproblem(Zl\u2217, Vl\u2217) # Algorithm 2\n\nWe use the notation Xi\u2217 to denote the ith row of X. Since this is a dual method, the primal variables\nX are not feasible (i.e. symmetric) until convergence. Essentially we have decomposed the original\nproblem, so that now we only need to solve the proximal operation for each node in isolation, namely\nthe subproblems:\n\n(h(k + 1) \u2212 h(k))(cid:12)(cid:12)x(k)\n\n(cid:12)(cid:12) + ||x \u2212 Zi\u2217 + V (l)\n\ni\u2217 ||2\n2.\n\n(3)\n\n\u2200i. X (l+1)\n\ni\u2217\n\n= arg min\n\nx\n\n\u03b1\n\u03c1\n\nn\u22121(cid:88)\n\nk\n\nNote that the dual variable has been integrated into the quadratic term by completing the square.\nAs the diagonal elements of X are not included in the sort ordering, they will be minimized by\nXii = Zii, for all i. Each subproblem is strongly convex as they consist of convex terms plus a\npositive quadratic term. This implies that the dual problem is differentiable (as the subdifferential\ncontains only one subgradient), hence the V update is actually gradient ascent. Since a \ufb01xed step\nsize is used, and the dual is Lipschitz continuous, for suf\ufb01ciently small step-size convergence is\nguaranteed. In practice we used \u03b7 = 0.9 for all our tests.\nThis dual decomposition subproblem can also be interpreted as just a step within the ADMM frame-\nwork. If applied in a standard way, only one dual variable update would be performed before another\nexpensive eigenvalue decomposition step. Since each iteration of the dual decomposition is much\nfaster than the eigenvalue decomposition, it makes more sense to treat it as a separate problem as\nwe propose here. It also ensures that the eigenvalue decomposition is only performed on symmetric\nmatrices.\nEach subproblem in our decomposition is still a non-trivial problem. They do have a closed form\nsolution, involving a sort and several passes over the node\u2019s edges, as described in Algorithm 2.\nProposition 3. Algorithm 2 solves the subproblem in equation 3.\nProof: See Appendix 1 in the supplementary material. The main subtlety is the grouping together\nof elements induced at the non-differentiable points. If multiple edges connected to the same node\nhave the same absolute value, their subdifferential becomes the same, and they behave as a single\npoint whose weight is the average. To handle this grouping, we use a disjoint-set data-structure,\nwhere each xj is either in a singleton set, or grouped in a set with other elements, whose absolute\nvalue is the same.\n\n4 Alternative degree priors\n\nUnder the restrictions on h detailed in Proposition 1, several other choices seem reasonable. The\nscale free prior can be smoothed somewhat, by the addition of a linear term, giving\n\nh\u0001,\u03b2(i) = log(i + \u0001) + \u03b2i,\n\n5\n\n\fAlgorithm 2 Dual decomposition subproblem (solveSubproblem)\n\n# w gives the sort order\n\ninput: vectors z, v\ninitialize: Disjoint-set datastructure with set membership function \u03b3\nw = z \u2212 v\nu = 0n\nbuild: sorted-to-original position function \u00b5 under descending absolute value order of w, excluding the\ndiagonal\nfor k = 0 until n \u2212 1 do\n\nj = \u00b5(k)\nuj = |wj| \u2212 \u03b1\n\u03b3(j).value = uj\nr = k\nwhile r > 1 and \u03b3(\u00b5(r)).value \u2265 \u03b3(\u00b5(r \u2212 1)).value do\n\n\u03c1 (h(k + 1) \u2212 h(k))\n(cid:80)\n\n1\n\njoin: the sets containing \u00b5(r) and \u00b5(r \u2212 1)\n\u03b3(\u00b5(r)).value =\nset: r to the \ufb01rst element of \u03b3(\u00b5(r)) by the sort ordering\n\ni\u2208\u03b3(\u00b5(r)) ui\n\n|\u03b3(\u00b5(r))|\n\nend while\n\nend for\nfor i = 1 to N do\nxi = \u03b3(i).value\nif xi < 0 then\n\nxj = 0\n\nend if\nif wi < 0 then\n\nxj = \u2212xj\n\nend if\nend for\nreturn x\n\n# negative values imply shrinkage to 0\n\n# Correct orthant\n\nwhere \u03b2 controls the strength of the smoothing. A slower diminishing choice would be a square-root\nfunction such as\n\nh\u03b2(i) = (i + 1)\n\n1\n\n2 \u2212 1 + \u03b2i.\n\nThis requires the linear term in order to correspond to a normalizable prior.\nIdeally we would choose h so that the expected degree distribution under the ERG model matches\nthe particular form we wish to encourage. Finding such a h for a particular graph size and degree\ndistribution amounts to maximum likelihood parameter learning, which for ERG models is a hard\nlearning problem. The most common approach is to use sampling based inference. Approaches\nbased on Markov chain Monte Carlo techniques have been applied widely to ERG models [19] and\nare therefore applicable to our model.\n\n5 Related Work\n\nThe covariance selection problem has recently been addressed by Liu and Ihler [14] using\nreweighted L1 regularization. They minimize the following objective:\n\nf (X) = (cid:104)X, C(cid:105) \u2212 log det X + \u03b1\n\nlog ((cid:107)X\u00acv(cid:107) + \u0001) + \u03b2\n\n|Xvv| .\n\n(cid:88)\n\nv\u2208V\n\n(cid:88)\n\nv\n\nThe regularizer is split into an off diagonal term which is designed to encourage sparsity in the edge\nparameters, and a more traditional diagonal term. Essentially they use (cid:107)X\u00acv(cid:107) as the continuous\ncounterpart of node v\u2019s degree. The biggest dif\ufb01culty with this objective is the log term, which\nmakes f highly non-convex. This can be contrasted to our approach, where we start with essentially\nthe same combinatorial prior, but we use an alternative, convex relaxation.\nThe reweighted L1 [7] aspect refers to the method of optimization applied. A double loop method is\nused, in the same class as EM methods and difference of convex programming, where each L1 inner\nproblem gives a monotonically improving lower bound on the true solution.\n\n6\n\n\fFigure 1: ROC curves for BA model (left) and \ufb01xed degree distribution model (right)\n\nFigure 2: Reconstruction of a gene association network using L1 (left), submodular relaxation (middle), and\nreweighted L1 (right) methods\n\nXij = \u22120.2 were assigned, and the node weights were set at Xii = 0.5 \u2212(cid:80)\n\n6 Experiments\nReconstruction of synthetic networks. We performed a comparison against the reweighted L1\nmethod of Liu and Ihler [14], and a standard L1 regularized method, both implemented using\nADMM for optimization. Although Liu and Ihler [14] use the glasso [10] method for the inner\nloop, ADMM will give identical results, and is usually faster [18]. Graphs with 60 nodes were gen-\nerated using both the Barabasi-Albert model [4] and a prede\ufb01ned degree distribution model sampled\nusing the method from Bayati et al. [5] implemented in the NetworkX software package. Both meth-\nods generate scale-free graphs; the BA model exhibits a scale parameter of 3.0, whereas we \ufb01xed\nthe scale parameter at 2.0 for the other model. To de\ufb01ne a valid Gaussian model, edge weights of\ni(cid:54)=j Xij so as to make\nthe resulting precision matrix diagonally dominant. The resulting Gaussian graphical model was\nsampled 500 times. The covariance matrix of these samples was formed, then normalized to have\ndiagonal uniformly 1.0. We tested with the two h sequences described in section 4. The parame-\nters for the degree weight sequences were chosen by grid search on random instances separate from\nthose we tested on. The resulting ROC curves for the Hamming reconstruction loss are shown in\nFigure 1. Results were averaged over 30 randomly generated graphs for each each \ufb01gure.\nWe can see from the plots that our method with the square-root weighting presents results superior\nto those from Liu and Ihler [14] for these datasets. This is encouraging particularly since our for-\nmulation is convex while the one from Liu and Ihler [14] isn\u2019t. Interestingly, the log based weights\ngive very similar but not identical results to the reweighting scheme which also uses a log term. The\nonly case where it gives inferior reconstructions is when it is forced to give a sparser reconstruction\nthan the original graph.\nReconstruction of a gene activation network. A common application of sparse covariance selec-\ntion is the estimation of gene association networks from experimental data. A covariance matrix of\ngene co-activations from a number of independent micro-array experiments is typically formed, on\nwhich a number of methods, including sparse covariance selection, can be applied. Sparse estima-\ntion is key for a consistent reconstruction due to the small number of experiments performed. Many\nbiological networks are conjectured to be scale-free, and additionally ERG modelling techniques are\nknown to produce good results on biological networks [16]. So we consider micro-array datasets a\nnatural test-bed for our method. We ran our method and the L1 reconstruction method on the \ufb01rst\n\n7\n\n0.000.050.100.150.200.25False Positives0.700.750.800.850.900.951.00True PositivesL1Reweighted L1Submodular logSubmodular root0.000.050.100.150.200.25False Positives0.700.750.800.850.900.951.00True PositivesL1Reweighted L1Submodular logSubmodular root\fFigure 3: Comparison of proximal operators\n\n500 genes from the GDS1429 dataset (http://www.ncbi.nlm.nih.gov/gds/1429), which contains 69\nsamples for 8565 genes. The parameters for both methods were tuned to produce a network with\nnear to 50 edges for visualization purposes. The major connected component for each is shown in\nFigure 2.\nWhile these networks are too small for valid statistical analysis of the degree distribution, the sub-\nmodular relaxation method produces a network with structure that is commonly seen in scale free\nnetworks. The star subgraph centered around gene 60 is more clearly de\ufb01ned in the submodular\nrelaxation reconstruction, and the tight cluster of genes in the right is less clustered in the L1 re-\nconstruction. The reweighted L1 method produced a quite different reconstruction, with greater\nclustering.\nRuntime comparison: different proximal opera-\ntor methods. We performed a comparison against\ntwo other methods for computing the proximal opera-\ntor: subgradient descent and the minimum norm point\n(MNP) algorithm. The MNP algorithm is a submodu-\nlar minimization method that can be adapted for com-\nputing the proximal operator [2]. We took the input pa-\nrameters from the last invocation of the proximal oper-\nator in the BA test, at a prior strength of 0.7. We then\nplotted the convergence rate of each of the methods,\nshown in Figure 3. As the tests are on randomly gen-\nerated graphs, we present only a representative exam-\nple. It is clear from this and similar tests that we per-\nformed that the subgradient descent method converges\ntoo slowly to be of practical applicability for this prob-\nlem. Subgradient methods can be a good choice when\nonly a low accuracy solution is required; for convergence of ADMM the error in the proximal opera-\ntor needs to be smaller than what can be obtained by the subgradient method. The MNP method also\nconverges slowly for this problem, however it achieves a low but usable accuracy quickly enough\nthat it could be used in practice. The dual decomposition method achieves a much better rate of\nconvergence, converging quickly enough to be of use even for strong accuracy requirements.\nThe time for individual iterations of each of the methods was 0.65ms for subgradient descent, 0.82ms\nfor dual decomposition and 15ms for the MNP method. The speed difference is small between a\nsubgradient iteration and a dual decomposition iteration as both are dominated by the cost of a sort\noperation. The cost of a MNP iteration is dominated by two least squares solves, whose running\ntime in the worst case is proportional to the square of the current iteration number. Overall, it is\nclear that our dual decomposition method is signi\ufb01cantly more ef\ufb01cient.\nRuntime comparison: submodular relaxation against other approaches. The running time of\nthe three methods we tested is highly dependent on implementation details, so the following speed\ncomparison should be taken as a rough guide. For a sparse reconstruction of a BA model graph with\n100 vertices and 200 edges, the average running time per 10\u22124 error reconstruction over 10 random\ngraphs was 16 seconds for the reweighted L1 method and 5.0 seconds for the submodular relaxation\nmethod. This accuracy level was chosen so that the active edge set for both methods had stabilized\nbetween iterations. For comparison, the standard L1 method was signi\ufb01cantly faster, taking only\n0.72 seconds on average.\nConclusion\nWe have presented a new prior for graph reconstruction, which enforces the recovery of scale-free\nnetworks. This prior falls within the growing class of structured sparsity methods. Unlike previous\napproaches to regularizing the degree distribution, our proposed prior is convex, making training\ntractable and convergence predictable. Our method can be directly applied in contexts where sparse\ncovariance selection is currently used, where it may improve the reconstruction quality.\nAcknowledgements\nNICTA is funded by the Australian Government as represented by the Department of Broadband,\nCommunications and the Digital Economy and the Australian Research Council through the ICT\nCentre of Excellence program.\n\n8\n\n020406080100Iteration10-810-710-610-510-410-310-210-1100Distance from solutionDual decompSubgradientMNP\fReferences\n[1] Francis Bach. Convex analysis and optimization with submodular functions: a tutorial. Technical report,\n\nINRIA, 2010.\n\n[2] Francis Bach. Structured sparsity-inducing norms through submodular functions. NIPS, 2010.\n[3] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-\n\ninducing penalties. Foundations and Trends in Machine Learning, 2012.\n\n[4] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science, 286:509\u2013\n\n512, 1999.\n\n[5] Moshen Bayati, Jeong Han Kim, and Amin Saberi. A sequential algorithm for generating random graphs.\n\nAlgorithmica, 58, 2009.\n\n[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3, 2011.\n[7] Emmanuel J. Candes, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted l1\n\nminimization. Journal of Fourier Analysis and Applications, 2008.\n\n[8] A. P. Dempster. Covariance selection. Biometrics, 28:157\u2013175, 1972.\n[9] Adrian Dobra, Chris Hans, Beatrix Jones, Joseph R Nevins, and Mike West. Sparse graphical models for\n\nexploring gene expression data. Journal of Multivariate Analysis, 2004.\n\n[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 2007.\n\n[11] Saruto Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n[12] Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, and Francis Bach. Proximal methods for sparse\n\nhierarchical dictionary learning. ICML, 2010.\n\n[13] Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Structured sparse principal component anal-\n\nysis. AISTATS, 2010.\n\n[14] Qiang Liu and Alexander Ihler. Learning scale free networks by reweighted l1 regularization. AISTATS,\n\n2011.\n\n[15] Zhaosong Lu. Smooth optimization approach for sparse covariance selection. SIAM J. Optim., 2009.\n[16] Zachary M. Saul and Vladimir Filkov. Exploring biological network structure using exponential random\n\ngraph models. Bioinformatics, 2007.\n\n[17] Katya Scheinberg and Shiqian Ma. Optimization for Machine Learning, chapter 17. optimization methods\n\nfor sparse inverse covariance selection. MIT Press, 2011.\n\n[18] Katya Scheinbert, Shiqian Ma, and Donald Goldfarb. Sparse inverse covariance selection via alternating\n\nlinearization methods. In NIPS, 2010.\n\n[19] T. Snijders. Markov chain monte carlo estimation of exponential random graph models. Journal of Social\n\nStructure, 2002.\n\n[20] Tom A.B. Snijders, Philippa E. Pattison, and Mark S. Handcock. New speci\ufb01cations for exponential\n\nrandom graph models. Technical report, University of Washington, 2004.\n\n[21] Alan Terry. Exponential random graphs. Master\u2019s thesis, University of York, 2005.\n[22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, 2007.\n\n9\n\n\f", "award": [], "sourceid": 615, "authors": [{"given_name": "Aaron", "family_name": "Defazio", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}