{"title": "Graph-Valued Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1423, "page_last": 1431, "abstract": "Undirected graphical models encode in a graph $G$ the dependency structure of a random vector $Y$. In many applications, it is of interest to model $Y$ given another random vector $X$ as input. We refer to the problem of estimating the graph $G(x)$ of $Y$ conditioned on $X=x$ as ``graph-valued regression''. In this paper, we propose a semiparametric method for estimating $G(x)$ that builds a tree on the $X$ space just as in CART (classification and regression trees), but at each leaf of the tree estimates a graph. We call the method ``Graph-optimized CART'', or Go-CART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data.", "full_text": "Graph-Valued Regression\n\nHan Liu Xi Chen John Lafferty Larry Wasserman\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nUndirected graphical models encode in a graph G the dependency structure of a\nrandom vector Y . In many applications, it is of interest to model Y given an-\nother random vector X as input. We refer to the problem of estimating the graph\nG(x) of Y conditioned on X = x as \u201cgraph-valued regression\u201d. In this paper,\nwe propose a semiparametric method for estimating G(x) that builds a tree on the\nX space just as in CART (classi\ufb01cation and regression trees), but at each leaf of\nthe tree estimates a graph. We call the method \u201cGraph-optimized CART\u201d, or Go-\nCART. We study the theoretical properties of Go-CART using dyadic partitioning\ntrees, establishing oracle inequalities on risk minimization and tree partition con-\nsistency. We also demonstrate the application of Go-CART to a meteorological\ndataset, showing how graph-valued regression can provide a useful tool for ana-\nlyzing complex data.\n\n1\n\nIntroduction\n\nLet Y be a p-dimensional random vector with distribution P . A common way to study the structure\nof P is to construct the undirected graph G = (V, E), where the vertex set V corresponds to the\np components of the vector Y . The edge set E is a subset of the pairs of vertices, where an edge\nbetween Yj and Yk is absent if and only if Yj is conditionally independent of Yk given all the\nother variables. Suppose now that Y and X are both random vectors, and let P (\u00b7| X) denote the\nconditional distribution of Y given X. In a typical regression problem, we are interested in the\nconditional mean \u00b5(x) = E (Y | X = x). But if Y is multivariate, we may be also interested in\nhow the structure of P (\u00b7| X) varies as a function of X. In particular, let G(x) be the undirected\ngraph corresponding to P (\u00b7| X = x). We refer to the problem of estimating G(x) as graph-valued\nregression.\nLet G = {G(x) : x \u2208 X} be a set of graphs indexed by x \u2208 X , where X is the domain of X.\nThen G induces a partition of X , denoted as X1, . . . ,Xm, where x1 and x2 lie in the same partition\nelement if and only if G(x1) = G(x2). Graph-valued regression is thus the problem of estimating\nthe partition and estimating the graph within each partition element.\n\nWe present three different partition-based graph estimators; two that use global optimization, and\none based on a greedy splitting procedure. One of the optimization based schemes uses penalized\nempirical risk minimization, the other uses held-out risk minimization. As we show, both methods\nenjoy strong theoretical properties under relatively weak assumptions; in particular, we establish\noracle inequalities on the excess risk of the estimators, and tree partition consistency (under stronger\nassumptions) in Section 4. While the optimization based estimates are attractive, they do not scale\nwell computationally when the input dimension is large. An alternative is to adapt the greedy algo-\nrithms of classical CART, as we describe in Section 3. In Section 5 we present experimental results\non both synthetic data and a meteorological dataset, demonstrating how graph-valued regression can\nbe an effective tool for analyzing high dimensional data with covariates.\n\n1\n\n\f(cid:3)\n\n(cid:4)\n\n(cid:2)\n(cid:1)\u2126 = arg min\n\n2 Graph-Valued Regression\nLet y1, . . . , yn be a random sample of vectors from P , where each yi \u2208 R\np. We are interested in\nthe case where p is large and, in fact, may diverge with n asymptotically. One way to estimate G\nfrom the sample is the graphical lasso or glasso [13, 5, 1], where one assumes that P is Gaussian\nwith mean \u00b5 and covariance matrix \u03a3. Missing edges in the graph correspond to zero elements in\nthe precision matrix \u2126 = \u03a3\u22121 [12, 4, 7]. A sparse estimate of \u2126 is obtained by solving\n\ntr(S\u2126) \u2212 log |\u2126| + \u03bb(cid:2)\u2126(cid:2)1\n\n(1)\nj,k |\u2126jk| is the\n\n\u2126(cid:1)0\n\nwhere \u2126 is positive de\ufb01nite, S is the sample covariance matrix, and (cid:2)\u2126(cid:2)1 =\n\nelementwise (cid:2)1-norm of \u2126. A fast algorithm for \ufb01nding(cid:1)\u2126 was given by Friedman et al. [5], which\nThe theoretical properties of(cid:1)\u2126 have been studied by Rothman et al. [10] and Ravikumar et al. [9].\n\ninvolves estimating a single row (and column) of \u2126 in each iteration by solving a lasso regression.\n\nIn practice, it seems that the glasso yields reasonable graph estimators even if Y is not Gaussian;\nhowever, proving conditions under which this happens is an open problem.\nWe brie\ufb02y mention three different strategies for estimating G(x), the graph of Y conditioned on\nX = x, each of which builds upon the glasso.\nParametric Estimators. Assume that Z = (X, Y ) is jointly multivariate Gaussian with covariance\nmatrix \u03a3 =\n. We can estimate \u03a3X, \u03a3Y , and \u03a3XY by their corresponding sample\n\nquantities (cid:1)\u03a3X, (cid:1)\u03a3Y , and (cid:1)\u03a3XY , and the marginal precision matrix of X, denoted as \u2126X, can be\nGaussian formulas. In particular, the conditional covariance matrix of Y | X is (cid:1)\u03a3Y |X = (cid:1)\u03a3Y \u2212\n(cid:1)\u03a3XY and a sparse estimate of(cid:1)\u2126Y |X can be obtained by directly plugging(cid:1)\u03a3Y |X into glasso.\n(cid:1)\u03a3Y X\n\nestimated using the glasso. The conditional distribution of Y given X = x is obtained by standard\n\n\u03a3X \u03a3XY\n\u03a3Y X\n\u03a3Y\n\n(cid:1)\u2126X\n\nHowever, the estimated graph does not vary with different values of X.\nKernel Smoothing Estimators. We assume that Y given X is Gaussian, but without making any\nassumption about the marginal distribution of X. Thus Y | X = x \u223c N(\u00b5(x), \u03a3(x)). Under\nthe assumption that both \u00b5(x) and \u03a3(x) are smooth functions of x, we estimate \u03a3(x) via kernel\n\n \n\n!\n\n(cid:1)\u00b5(x) =\n\nsmoothing:(cid:1)\u03a3(x) =\n\nn(cid:5)\n\ni=1\n\nK\n\n(cid:6)(cid:2)x \u2212 xi(cid:2)\nn(cid:5)\n\nh\n\n(cid:7)\n(yi \u2212(cid:1)\u00b5(x)) (yi \u2212(cid:1)\u00b5(x))T\n(cid:6)(cid:2)x \u2212 xi(cid:2)\n\n(cid:8) n(cid:5)\n\n(cid:8) n(cid:5)\n(cid:6)(cid:2)x \u2212 xi(cid:2)\n\n(cid:7)\n\ni=1\n\nK\n\n(cid:6)(cid:2)x \u2212 xi(cid:2)\n(cid:7)\n\nh\n\n(cid:7)\n\nwhere K is a kernel (e.g. the probability density function of the standard Gaussian distribution), (cid:2)\u00b7(cid:2)\nis the Euclidean norm, h > 0 is a bandwidth and\n\n.\n\nh\n\nh\n\nK\n\nK\n\nyi\n\ni=1\n\nNow we apply glasso in (1) with S =(cid:1)\u03a3(x) to obtain an estimate of G(x). This method is appealing\nX1, . . . ,Xm. Within each Xj, we apply the glasso to get an estimated graph (cid:1)Gj. We then take\n(cid:1)G(x) = (cid:1)Gj for all x \u2208 Xj. To \ufb01nd the partition, we appeal to the idea used in CART (classi\ufb01cation\n\nbecause it is simple and very similar to nonparametric regression smoothing; the method was ana-\nlyzed for one-dimensional X in [14]. However, while it is easy to estimate G(x) at any given x, it\nrequires global smoothness of the mean and covariance functions.\nIn this approach, we partition X into \ufb01nitely many connected regions\nPartition Estimators.\n\ni=1\n\nand regression trees) [3]. We take the partition elements to be recursively de\ufb01ned hyperrectangles.\nAs is well-known, we can then represent the partition by a tree, where each leaf node corresponds to\na single partition element. In CART, the leaves are associated with the means within each partition\nelement; while in our case, there will be an estimated undirected graph for each leaf node. We refer\nto this method as Graph-optimized CART, or Go-CART. The remainder of this paper is devoted to\nthe details of this method.\n\n3 Graph-Optimized CART\nLet X \u2208 R\np be two random vectors, and let {(x1, y1), . . . , (xn, yn)} be n i.i.d. samples\nfrom the joint distribution of (X, Y ). The domains of X and Y are denoted by X and Y respectively;\n\nd and Y \u2208 R\n\n2\n\n\fd \u2192 R\n\nd \u2192 R\n\np is a vector-valued mean function and \u03a3 : R\n\np\u00d7p is a matrix-valued\nwhere \u00b5 : R\ncovariance function. We also assume that for each x, \u2126(x) = \u03a3(x)\u22121 is a sparse matrix, i.e., many\nelements of \u2126(x) are zero. In addition, \u2126(x) may also be a sparse function of x, i.e., \u2126(x) = \u2126(xR)\nfor some R \u2282 {1, . . . , d} with cardinality |R| (cid:6) d. The task of graph-valued regression is to \ufb01nd\n\na sparse inverse covariance (cid:1)\u2126(x) to estimate \u2126(x) for any x \u2208 X ; in some situations the graph of\nnected regions X1, . . . ,Xm, and within each Xj we apply the glasso to estimate a graph (cid:1)Gj. We\nthen take (cid:1)G(x) = (cid:1)Gj for all x \u2208 Xj. To \ufb01nd the partition, we restrict ourselves to dyadic splits,\n\n\u2126(x) is of greater interest than the entries of \u2126(x) themselves.\nGo-CART is a partition based conditional graph estimator. We partition X into \ufb01nitely many con-\n\nas studied by [11, 2]. The primary reason for such a choice is the computational and theoretical\ntractability of dyadic partition based estimators.\nLet T denote the set of dyadic partitioning trees (DPTs) de\ufb01ned over X = [0, 1]d, where each\n(cid:9)d\nDPT T \u2208 T is constructed by recursively dividing X by means of axis-orthogonal dyadic splits.\nEach node of a DPT corresponds to a hyperrectangle in [0, 1]d. If a node is associated to the hyper-\nrectangle A =\nl=1[al, bl], then after being dyadically split along dimension k, the two children\nare associated with the sub-hyperrectangles A(k)\nl>k[al, bl] and\nA(k)\nL . Given a DPT T , we denote by \u03a0(T ) = {X1, . . . ,XmT\nR = A\\A(k)\n} the partition of X induced\nby the leaf nodes of T . For a dyadic integer N = 2K, we de\ufb01ne TN to be the collection of all DPTs\nsuch that no partition has a side length smaller than 2\u2212K. Let I(\u00b7) denote the indicator function. We\ndenote \u00b5T (x) and \u2126T (x) as the piecewise constant mean and precision functions associated with T :\n\nl 0 denote a pre\ufb01x code over all DPTs T \u2208 TN satisfying\nT\u2208TN 2\u2212[[T ]] \u2264 1. One\nsuch pre\ufb01x code [[T ]] is proposed in [11], and takes the form [[T ]] = 3|\u03a0(T )| \u2212 1 + (|\u03a0(T )| \u2212\n1) log d/ log 2. A simple upper bound for [[T ]] is\n\n(cid:17)\n(cid:17)\n\u00b7 I (xi \u2208 Xj)\n\n(cid:10)(cid:11)\nmT(cid:5)\n(cid:10)(cid:11)\nmT(cid:5)\nn(cid:5)\n\n(yi \u2212 \u00b5Xj )(yi \u2212 \u00b5Xj )T\n\n\u00b7 I (X \u2208 Xj)\n\n(cid:14)(cid:15)\n(cid:14)(cid:15)\n\n(cid:16)\n(cid:16)\n\n(cid:12)\n(cid:12)\n\n\u2126Xj\n\n\u2126Xj\n\n1\nn\n\n(2)\n\ni=1\n\nj=1\n\nj=1\n\ntr\n\ntr\n\n|\n\n,\n\n.\n\nE\n\n[[T ]] \u2264 (3 + log d/ log 2)|\u03a0(T )|.\n\n(4)\nOur analysis will assume that the conditional means and precision matrices are bounded in the\n(cid:2) \u00b7 (cid:2)\u221e and (cid:2) \u00b7 (cid:2)1 norms; speci\ufb01cally we suppose there is a positive constant B and a sequence\nL1,n, . . . , LmT ,n, where each Lj,n \u2208 R+ is a function of the sample size n, and we de\ufb01ne the\ndomains of each \u00b5Xj and \u2126Xj as\n\nmT(cid:5)\n\nmT(cid:5)\n\nMj = {\u00b5 \u2208 R\n\u2126 \u2208 R\n\u039bj =\n\n(cid:2)\n(cid:19)mbT\n\n(cid:3)\np : (cid:2)\u00b5(cid:2)\u221e \u2264 B} ,\np\u00d7p : \u2126 is positive de\ufb01nite, symmetric, and (cid:2)\u2126(cid:2)1 \u2264 Lj,n\n(cid:21)\n(cid:20)(cid:1)R(T, \u00b5T , \u2126T ) + pen(T )\n\n\u2208Mj ,\u2126Xj\n\n\u2208\u039bj\n\nWith this notation in place, we can now de\ufb01ne two estimators.\nDe\ufb01nition 1. The penalized empirical risk minimization Go-CART estimator is de\ufb01ned as\n\n(cid:18)(cid:1)\u00b5bXj\n\n(cid:1)T ,\n\n,(cid:1)\u2126bXj\n\n.\n\n(5)\n\nand for simplicity we take X = [0, 1]d. We assume that\n\nY | X = x \u223c Np(\u00b5(x), \u03a3(x))\n\nwhere (cid:1)R is de\ufb01ned in (3) and pen(T ) = \u03b3n \u00b7 mT\n\n= argminT\u2208TN ,\u00b5Xj\n\nj=1\n\n(cid:22)\n\n[[T ]] log 2+2 log(np)\n\nn\n\n.\n\n3\n\n\fEmpirically, we may always set the dyadic integer N to be a reasonably large value; the regulariza-\ntion parameter \u03b3n is responsible for selecting a suitable DPT T \u2208 TN .\nWe also formulate an estimator that minimizes held-out risk. Practically, we could split the data into\ntwo partitions: D1 = {(x1, y1), . . . , (xn1 , yn1)} for training and D2 = {((x(cid:4)\nn2))}\n1), . . . , (x(cid:4)\n, y(cid:4)\n, y(cid:4)\nn2\nfor validation with n1 + n2 = n. The held-out negative log-likelihood risk is then given by\n(cid:19)\n(cid:16)\ni \u2208 Xj)\n\ni \u2212 \u00b5Xj )(y(cid:4)\n(y(cid:4)\n\n\u2212 log |\u2126Xj\n\ni \u2212 \u00b5Xj )T\n\n\u00b7 I (x(cid:4)\n\n\u2126Xj\n\n(6)\n\ntr\n\n|\n\n1\n\n.\n\n(cid:13)\n\n1\nn2\n\nmT(cid:5)\n\nn2(cid:5)\n\n(cid:1)Rout(T, \u00b5T , \u2126T ) =\n(cid:18)(cid:11)\n(cid:12)\n(cid:1)\u00b5T ,(cid:1)\u2126T = argmin\u00b5Xj\n(cid:1)T = argminT\u2208TN\n\nj=1\n\ni=1\n\n(cid:14)(cid:15)\n(cid:1)R(T, \u00b5T , \u2126T )\n\n\u2208\u039bj\n\nDe\ufb01nition 2. For each DPT T de\ufb01ne\n\n\u2208Mj ,\u2126Xj\n\nminimization Go-CART estimator is\n\nwhere (cid:1)R is de\ufb01ned in (3) but only evaluated on D1 = {(x1, y1), . . . , (xn1 , yn1)}. The held-out risk\nwhere (cid:1)Rout is de\ufb01ned in (6) but only evaluated on D2.\n((cid:1)T ,(cid:1)\u00b5T ,(cid:1)\u2126T ). We focus on the held-out risk minimization form as in De\ufb01nition 2, due to its superior\n\nThe above procedures require us to \ufb01nd an optimal dyadic partitioning tree within TN . Although\ndynamic programming can be applied, as in [2], the computation does not scale to large input dimen-\nsions d. We now propose a simple yet effective greedy algorithm to \ufb01nd an approximate solution\n\n(cid:1)Rout(T,(cid:1)\u00b5T ,(cid:1)\u2126T ).\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n(cid:13)\n\nn1(cid:5)\n\n(cid:4)n1\n\nempirical performance. But note that our greedy approach is generic and can easily be adapted to\nthe penalized empirical risk minimization form.\nFirst, consider the simple case that we are given a dyadic tree structure T which induces a partition\n\u03a0(T )={X1, . . . ,XmT\n} on X . For any partition element Xj, we estimate the sample mean using D1:\n(cid:1)\u00b5Xj =\n(cid:4)n1\n\nThe glasso is then used to estimate a sparse precision matrix (cid:1)\u2126Xj . More precisely, let (cid:1)\u03a3Xj be the\n(cid:14)T \u00b7 I (xi \u2208 Xj) .\nyi \u2212(cid:1)\u00b5Xj\nThe estimator(cid:1)\u2126Xj is obtained by optimizing(cid:1)\u2126Xj = arg min\u2126(cid:1)0\n{tr((cid:1)\u03a3Xj \u2126) \u2212 log |\u2126| + \u03bbj(cid:2)\u2126(cid:2)1},\n\nsample covariance matrix for the partition element Xj, given by\n\nyi \u00b7 I (xi \u2208 Xj) .\n(cid:14)(cid:13)\n\n1\nI (xi \u2208 Xj)\nn1(cid:5)\n\nyi \u2212(cid:1)\u00b5Xj\n\n1\nI (xi \u2208 Xj)\n\n(cid:1)\u03a3Xj =\n\nwhere \u03bbj is in one-to-one correspondence with Lj,n in (5). In practice, we run the full regularization\npath of the glasso, from large \u03bbj, which yields very sparse graph, to small \u03bbj, and select the graph\nthat minimizes the held-out negative log-likelihood risk. To further improve the model selection per-\nformance, we re\ufb01t the parameters of the precision matrix after the graph has been selected. That is,\nto reduce the bias of the glasso, we \ufb01rst estimate the sparse precision matrix using (cid:2)1-regularization,\nand then we re\ufb01t the Gaussian model without (cid:2)1-regularization, but enforcing the sparsity pattern\nobtained in the \ufb01rst step.\nThe natural, standard greedy procedure starts from the coarsest partition X = [0, 1]d and then\ncomputes the decrease in the held-out risk by dyadically splitting each hyperrectangle A along\ndimension k \u2208 {1, . . . d}. The dimension k\u2217\nthat results in the largest decrease in held-out risk is\nselected, where the change in risk is given by\n\n\u2206(cid:1)R(k)\nout(A,(cid:1)\u00b5A,(cid:1)\u2126A) = (cid:1)Rout(A,(cid:1)\u00b5A,(cid:1)\u2126A) \u2212 (cid:1)Rout(A(k)\n\n).\nIf splitting any dimension k of A leads to an increase in the held-out risk, the element A should no\nlonger be split and hence becomes a partition element of \u03a0(T ). The details and pseudo code are\nprovided in the supplementary materials.\n\n) \u2212 (cid:1)Rout(A(k)\n\nR ,(cid:1)\u00b5A(k)\n\nL ,(cid:1)\u00b5A(k)\n\n,(cid:1)\u2126A(k)\n\n,(cid:1)\u2126A(k)\n\nR\n\nR\n\nL\n\nL\n\nThis greedy partitioning method parallels the classical algorithms for classi\ufb01cation and regression\nthat have been used in statistical learning for decades. However, the strength of the procedures given\nin De\ufb01nitions 1 and 2 is that they lend themselves to a theoretical analysis under relatively weak\nassumptions, as we show in the following section. The theoretical properties of greedy Go-CART\nare left to future work.\n\n4\n\n\f4 Theoretical Properties\nWe de\ufb01ne the oracle risk R\u2217\n\nover TN as\n\u2217\nT , \u2126\nT ) =\n\n= R(T \u2217, \u00b5\u2217\n\nR\u2217\n\nT\u2208TN ,\u00b5Xj\n\ninf\n\u2208Mj ,\u2126Xj\n\n\u2208\u039bj\n\nR(T, \u00b5T , \u2126T ).\n\n, \u00b5\u2217\n\nT \u2217, and \u2126\u2217\n\nNote that T \u2217\nT \u2217 might not be unique, since the \ufb01nest partition always achieves the oracle\nrisk. To obtain oracle inequalities, we make the following two technical assumptions.\nAssumption 1. Let T \u2208 TN be an arbitrary DPT which induces a partition \u03a0(T ) =\n{X1, . . . ,XmT\n\n} on X , we assume that there exists a constant B, such that\n\nwhere \u039bj is de\ufb01ned in (5) and Ln = max1\u2264j\u2264mT\nassume that\n\nAssumption 2. Let Y = (Y1, . . . , Yp)T \u2208 R\n\nmax\n1\u2264j\u2264mT\n\nsup\n\u2126\u2208\u039bj\n\n(cid:2)\u00b5Xj\n\nlog |\u2126| \u2264 Ln\n\n(cid:2)\u221e \u2264 B and max\n1\u2264j\u2264mT\nLj,n, where Lj,n is the same as in (5). We also\n\u221a\nn).\nLn = o(\np. For any A \u2282 X , we de\ufb01ne\nZk(cid:1)(A) = YkY(cid:1) \u00b7 I(X \u2208 A) \u2212 E(YkY(cid:1) \u00b7 I(X \u2208 A))\nZj(A) = Yj \u00b7 I(X \u2208 A) \u2212 E(Yj \u00b7 I(X \u2208 A)).\n\nsup\n\nWe assume there exist constants M1, M2, v1, and v2, such that\n\nk,(cid:1),A E|Zk(cid:1)(A)|m \u2264 m!M m\u22122\n\nj,A E|Zj(A)|m \u2264 m!M m\u22122\nfor all m \u2265 2.\nTheorem 1. Let T \u2208 TN be a DPT that induces a partition \u03a0(T ) = {X1, . . . ,XmT\n\n\u03b4 \u2208 (0, 1/4), let (cid:1)T ,(cid:1)\u00b5bT ,(cid:1)\u2126bT be the estimator obtained using the penalized empirical risk minimiza-\n\n} on X . For any\n\nand sup\n\nv1\n\nv2\n\n2\n2\n\n1\n2\n\ntion Go-CART in De\ufb01nition 1, with a penalty term pen(T ) of the form\n\n(cid:23)\n\npen(T ) = (C1 + 1)LnmT\n\u221a\nv2 + 8B\nwhere C1 = 8\n\n\u221a\nv1 + B2. Then for suf\ufb01ciently large n, the excess risk inequality\n\nn\n\n[[T ]] log 2 + 2 log p + log(48/\u03b4)\n\n(cid:25)\n\n(cid:24)\n\n2pen(T ) +\n\ninf\n\n\u2208Mj ,\u2126Xj\n\n\u2208\u039bj\n\n\u00b5Xj\n\n(R(T, \u00b5T , \u2126T ) \u2212 R\u2217\n\n)\n\nR((cid:1)T ,(cid:1)\u00b5bT ,(cid:1)\u2126bT ) \u2212 R\u2217 \u2264 inf\n\nT\u2208TN\nholds with probability at least 1 \u2212 \u03b4.\n\n\u221a\n\n2v1 +\n\n(cid:23)\n\n} on X . We\n\nA similar oracle inequality holds when using the held-out risk minimization Go-CART.\nTheorem 2. Let T \u2208 TN be a DPT which induces a partition \u03a0(T ) = {X1, . . . ,XmT\nde\ufb01ne \u03c6n(T ) to be a function of n and T such that\n\n[[T ]] log 2 + 2 log p + log(384/\u03b4)\n\n\u221a\n2)LnmT\n\u221a\n2B2 and Ln = max1\u2264j\u2264mT\nLj,n. Partition the data into\nn2)} with sizes n1 = n2 =\n, y(cid:4)\n\nn\n, y(cid:4)\n1), . . . , (x(cid:4)\n\n\u03c6n(T ) = (C2 +\n\u221a\nwhere C2 = 8\n2v2 + 8B\nD1 = {(x1, y1), . . . , (xn1 , yn1)} and D2 = {(x(cid:4)\n\nDe\ufb01nition 2. Then, for suf\ufb01ciently large n, the excess risk inequality\n\nn/2. Let (cid:1)T ,(cid:1)\u00b5bT ,(cid:1)\u2126bT be the estimator constructed using the held-out risk minimization criterion of\nR((cid:1)T ,(cid:1)\u00b5bT ,(cid:1)\u2126bT ) \u2212 R\u2217 \u2264 inf\ndue to the extra \u03c6n((cid:1)T ) term, which depends on the complexity of the \ufb01nal estimate (cid:1)T . Due to space\n\nNote that in contrast to the statement in Theorem 1, Theorem 2 results in a stochastic upper bound\n\nT\u2208TN\nwith probability at least 1 \u2212 \u03b4.\n\n(R(T, \u00b5T , \u2126T ) \u2212 R\u2217\n\n+ \u03c6n((cid:1)T )\n\n3\u03c6n(T ) +\n\n\u2208Mj ,\u2126Xj\n\n(cid:25)\n\n(cid:24)\n\n\u2208\u039bj\n\ninf\n\n\u00b5Xj\n\nn2\n\nlimitations, the proofs of both theorems are detailed in the supplementary materials.\nWe now temporarily make the strong assumption that the model is correct, so that Y given X is\nconditionally Gaussian, with a partition structure that is given by a dyadic tree. We show that with\nhigh probability, the true dyadic partition structure can be correctly recovered.\n\n)\n\n1\n\n5\n\n\fUnder this assumption, clearly\nR(T \u2217, \u00b5\u2217\n\nmT(cid:5)\nwhere MT is given by\n\n(cid:18)\n\nMT =\n\n\u00b5(x) =\n\n\u2217\nT \u2217 , \u2126\nT \u2217) =\n\nT\u2208TN ,\u00b5T ,\u2126T \u2208MT\n\nR(T, \u00b5T , \u2126T ),\n\ninf\n\nmT(cid:5)\n\n(7)\n\n(cid:19)\n\n.\n\nAssumption 3. The true model is\n\u2217\nT \u2217(x))\nT \u2217(x), \u2126\nmT \u2217(cid:5)\nwhere T \u2217 \u2208 TN is a DPT with induced partition \u03a0(T \u2217) = {X \u2217\nj }mT \u2217\nj I(x \u2208 X \u2217\n\u2217\nj ).\n\u2126\n\nY | X = x \u223c Np(\u00b5\u2217\nmT \u2217(cid:5)\n\n\u2217\nT \u2217(x) =\nj ), \u2126\n\nj I(x \u2208 X \u2217\n\u00b5\u2217\n\n\u00b5\u2217\nT \u2217(x) =\n\nj=1 and\n\nj=1\n\nj=1\n\nI(x \u2208 Xj), \u2126(x) =\n\n\u00b5Xj\n\nI(x \u2208 Xj) : \u00b5Xj\n\n\u2208 Mj, \u2126Xj\n\n\u2208 \u039bj\n\n\u2126Xj\n\nj=1\n\nj=1\n\nLet T1 and T2 be two DPTs, if \u03a0(T1) can be obtained by further split the hyperrectangles within\n\u03a0(T2), we say \u03a0(T2) \u2282 \u03a0(T1). We then have the following de\ufb01nitions:\n\nDe\ufb01nition 3. A tree estimation procedure (cid:1)T is tree partition consistent in case\n\n(cid:11)\n\n\u03a0(T \u2217\n\nP\n\n(cid:16)\n) \u2282 \u03a0((cid:1)T )\n\n\u2192 1 as n \u2192 \u221e.\n\nNote that the estimated partition may be \ufb01ner than the true partition. Establishing a tree parti-\ntion consistency result requires further technical assumptions. The following assumption speci\ufb01es\nthat for arbitrary adjacent subregions of the true dyadic partition, either the means or the variances\nshould be suf\ufb01ciently different. Without such an assumption, of course, it is impossible to detect the\nboundaries of the true partition.\nAssumption 4. Let X \u2217\ni and X \u2217\nj be adjacent partition elements of T \u2217\n= (\u2126\u2217\n. Let \u03a3\u2217\nparent node within T \u2217\nX \u2217\nX \u2217\nsuch that either\n+ \u03a3\u2217\nX \u2217\n2\n\n, so that they have a common\n)\u22121. We assume there exist positive constants c1, c2, c3, c4,\n\n(cid:26)(cid:26)(cid:26)(cid:26)(cid:26) \u2212 log |\u03a3\n\n| \u2212 log |\u03a3\n\u2217\nX \u2217\n\n(cid:26)(cid:26)(cid:26)(cid:26)(cid:26)\u03a3\u2217\n\n| \u2265 c4\n\n2 log\n\n\u2217\nX \u2217\n\nX \u2217\n\nj\n\nj\n\ni\n\ni\n\ni\n\ni\n\nor (cid:2)\u00b5\u2217\nX \u2217\n\ni\n\n\u2212 \u00b5\u2217\nX \u2217\n\nj\n\n(cid:2)2\n\n2\n\n\u2265 c3. We also assume\n\n\u2217\n\u03c1min(\u2126\nX \u2217\n\n) \u2265 c1,\n\n\u2200j = 1, . . . , mT \u2217 ,\n\nwhere \u03c1min(\u00b7) denotes the smallest eigenvalue. Furthermore, for any T \u2208 TN and any A \u2208 \u03a0(T ),\nwe have P (X \u2208 A) \u2265 c2.\n\nj\n\nTheorem 3. Under the above assumptions, we have\n\ninf\n\nT\u2208TN , \u03a0(T \u2217)(cid:1)\u03a0(T )\n\ninf\n\n\u00b5T , \u2126T \u2208MT\n\nR(T, \u00b5T , \u2126T ) \u2212 R(T \u2217, \u00b5\u2217\n\nT \u2217) > min{ c1c2c3\n\u2217\nT \u2217 , \u2126\n\n, c2c4}\n\n2\n\nwhere c1, c2, c3, c4 are de\ufb01ned in Assumption 4. Moreover, the Go-CART estimator in both the\npenalized risk minimization and held-out risk minimization form is tree partition consistent.\nThis result shows that, with high probability, we obtain a \ufb01ner partition than T \u2217\n; the assumptions\ndo not, however, control the size of the resulting partition. The proof of this result appears in the\nsupplementary material.\n\n5 Experiments\n\nWe now present the performance of the greedy partitioning algorithm of Section 3 on both synthetic\ndata and a real meteorological dataset. In the experiment, we always set the dyadic integer N = 210\nto ensure that we can obtain \ufb01ne-tuned partitions of the input space X .\n5.1 Synthetic Data\nWe generate n data points x1, . . . , xn \u2208 R\nd with n = 10, 000 and d = 10 uniformly distributed on\nthe unit hypercube [0, 1]d. We split the square [0, 1]2 de\ufb01ned by the \ufb01rst two dimension of the unit\n\n6\n\n\f20\n\n1\n\n2\n\n3\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n9\n\n10\n\n4\n\n8\n\n5\n\n6\n\n7\n\n 1\n\n20\n\n1\n\n2\n\n19\n\n18\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n X1<\n 0.5\n\n17\n\n16\n\n15\n\n1\n\n20\n\n2\n\n14\n\n3\n\n4\n\n13\n\n12\n\n11\n\n9\n\n10\n\n5\n\n6\n\n7\n\n8\n\n12\n\n11\n\n9\n\n10\n\n 2\n X2<\n 0.5\n\n 4\n\n17\n\n16\n\n15\n\n19\n\n18\n\n14\n\n13\n\n X1>\n 0.5\n\n20\n\n1\n\n2\n\n3\n\n5\n\n6\n\n7\n\n4\n\n8\n\n12\n\n11\n\n9\n\n10\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n 3\n\n X2>\n 0.5\n\n 7\n\n X2<\n 0.25\n\n X2>\n 0.25\n\n X1<\n 0.75\n\n X1>\n 0.75\n\n 8\n X1<\n 0.25\n\n 9\n\n X1>\n 0.25\n\n 10\n\n X2<\n 0.75\n\n 11\n\n X2>\n 0.75\n\n X2>\n 0.5\n\n X2<\n 0.5\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n3\n\n13\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n1\n\n20\n\n2\n\n19\n\n13\n\n2\n\n12\n\n11\n\n10\n\n17\n\n16\n\n15\n\n18\n\n14\n\n1\n\n20\n\n19\n\n3\n\n9\n\n4\n\n8\n\n5\n\n6\n\n7\n\n5\n\n6\n\n7\n\n4\n\n8\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n19\n\n18\n\n17\n\n16\n\n15\n\n12\n\n11\n\n10\n\n20\n\n1\n\n2\n\n3\n\n14\n\n13\n\n12\n\n11\n\n9\n\n10\n\n20\n\n1\n\n2\n\n3\n\n9\n\n5\n\n6\n\n7\n\n4\n\n8\n\n12\n\n10\n\n11\n\n1\n\n20\n\n2\n\n19\n\n18\n\n3\n\n4\n\n17\n\n16\n\n15\n\n5\n\n6\n\n7\n\n14\n\n13\n\n8\n\n9\n\n12\n\n11\n\n10\n\n20\n\n1\n\n2\n\n17\n\n16\n\n15\n\n19\n\n18\n\n14\n\n13\n\n12\n\n10\n\n11\n\n1\n\n20\n\n19\n\n18\n\n17\n\n16\n\n15\n\n3\n\n9\n\n2\n\n5\n\n6\n\n7\n\n4\n\n8\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n5\n\n41\n\n40\n\n43\n\n42\n\n18\n\n17\n\n38 39\n\n36 37\n\n33\n\n32\n\n35\n\n34\n\n13\n\n14\n\n30 31\n\n28 29\n\n6\n\n(b)\n\nk\ns\nR\n\ni\n\n \nt\n\nu\no\n\u2212\nd\ne\nH\n\nl\n\n21.3\n\n21.2\n\n21.1\n\n21\n\n20.9\n\n20.8\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nSplitting Sequence No.\n\n(c)\n\n 12\n\n X2<\n 0.125\n\n X2>\n 0.125\n\n X1>\n 0.25\n\n X1<\n 0.25\n\n 15\n\n X1<\n 0.375\n\n X1>\n 0.375\n\n 16\n\n X2<\n 0.625\n\n X2>\n 0.625\n\n X2>\n 0.75\n\n X2<\n 0.75\n\n 19\n\n X1<\n 0.875\n\n X1>\n 0.875\n\n 20\n\n 21\n\n X1<\n 0.125\n\n X1>\n 0.125\n\n X1<\n 0.125\n\n X1>\n 0.125\n\n 22\n\n 23\n\n X2<\n 0.375\n\n X2>\n 0.375\n\n X2<\n 0.375\n\n X2>\n 0.375\n\n 24\n\n 25\n\n X1<\n 0.625\n\n X1>\n 0.625\n\n X1<\n 0.625\n\n X1>\n 0.625\n\n 26\n\n 27\n\n X2<\n 0.875\n\n X2>\n 0.875\n\n X2<\n 0.875\n\n X2>\n 0.875\n\n 28\n\n 29\n\n 30\n\n 31\n\n 13\n\n 14\n\n 32\n\n 33\n\n 34\n\n 35\n\n 5\n\n 6\n\n 36\n\n 37\n\n 38\n\n 39\n\n 17\n\n 18\n\n 40\n\n 41\n\n 42\n\n 43\n\n14\n\n13\n\n9\n\n12\n\n11\n\n10\n\n20\n\n1\n\n2\n\n1\n\n20\n\n2\n\n20\n\n1\n\n2\n\n17\n\n16\n\n15\n\n19\n\n18\n\n14\n\n13\n\n3\n\n4\n\n19\n\n18\n\n5\n\n6\n\n7\n\n8\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n9\n\n10\n\n3\n\n9\n\n5\n\n6\n\n7\n\n4\n\n8\n\n17\n\n16\n\n15\n\n19\n\n18\n\n14\n\n13\n\n12\n\n11\n\n10\n\n12\n\n11\n\n10\n\n20\n\n1\n\n2\n\n3\n\n19\n\n18\n\n14\n\n13\n\n12\n\n11\n\n9\n\n10\n\n4\n\n8\n\n5\n\n6\n\n7\n\n20\n\n19\n\n1\n\n2\n\n3\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n9\n\n12\n\n10\n\n11\n\n4\n\n8\n\n5\n\n6\n\n7\n\n17\n\n16\n\n15\n\n19\n\n18\n\n14\n\n13\n\n20\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n12\n\n10\n\n11\n\n3\n\n4\n\n8\n\n9\n\n5\n\n6\n\n7\n\n17\n\n16\n\n15\n\n(a)\n\nFigure 1: Analysis of synthetic data. (a) Estimated dyadic tree structure; (b) Ground true partition. The hori-\nzontal axis corresponds to the \ufb01rst dimension denoted as X1 while the vertical axis corresponds to the second\ndimension denoted by X2. The bottom left point corresponds to [0, 0] and the upper right point corresponds to\n[1, 1]. It is also the induced partition on [0, 1]2. The number labeled on each subregion corresponds to each leaf\nnode ID of the tree in (a); (c) The held-out negative log-likelihood risk for each split. The order of the splits\ncorresponds the ID of the tree node (from small to large).\nhypercube into 22 subregions as shown in Figure 1 (b). For the t-th subregion where 1 \u2264 t \u2264 22,\nwe generate an Erd\u00a8os-R\u00b4enyi random graph Gt = (V t, Et) with the number of vertices p = 20,\nthe number of edges |E| = 10 and the maximum node degree is four. Based on Gt, we generate\ni,j = I(i = j) + 0.245 \u00b7 I((i, j) \u2208 Et), where\nthe inverse covariance matrix \u2126t according to \u2126t\n0.245 guarantees the positive de\ufb01niteness of \u2126t when the maximum node degree is 4. For each data\npoint xi in the t-th subregion, we sample a 20-dimensional response vector yi from a multivariate\nGaussian distribution N20\n. We also create an equally-sized held-out dataset in the same\nmanner based on {\u2126t}22\nt=1.\nThe learned dyadic tree structure and its induced partition are presented in Figure 1. We also provide\nthe estimated graphs for some nodes. We conduct 100 monte-carlo simulations and \ufb01nd that 82 times\nout of 100 runs our algorithm perfectly recover the ground true partitions on the X1-X2 plane and\nnever wrongly split any irrelevant dimensions ranging from X3 to X10. Moreover, the estimated\ngraphs have interesting patterns. Even though the graphs within each subregion are sparse, the\nestimated graph obtained by pooling all the data together is highly dense. As the greedy algorithm\nproceeds, the estimated graphs become sparser and sparser. However, for the immediate parent\nof the leaf nodes, the graphs become denser again. Out of the 82 simulations where we correctly\nidentify the tree structure, we list the graph estimation performance for subregions 28, 29, 13, 14, 5,\n6 in terms of precision, recall, and F1-score in Table 1.\n\n(cid:14)\u22121\n\n(cid:13)\n\n(cid:13)\n\n(cid:14)\n\n\u2126t\n\n0,\n\nTable 1: The graph estimation performance over different subregions\n\nMean values over 100 runs (Standard deviation)\n\nsubregion\n\nregion 28\n\nregion 29\n\nregion 13\n\nregion 14\n\nregion 5\n\nregion 6\n\nPrecision\nRecall\nF1 \u2212 score\n\n0.8327 (0.15)\n0.7890 (0.16)\n0.7880 (0.11)\n\n0.8429 (0.15)\n0.7990 (0.18)\n0.7923 (0.12)\n\n0.9853 (0.04)\n1.0000 (0.00)\n0.9921 (0.02)\n\n0.9821 (0.05)\n1.0000 (0.00)\n0.9904 (0.03)\n\n0.9906 (0.04)\n1.0000 (0.00)\n0.9949 (0.02)\n\n0.9899 (0.05)\n1.0000 (0.00)\n0.9913 (0.02)\n\nWe see that for a larger subregion (e.g. 13, 14, 5, 6), it is easier to obtain better recovery perfor-\nmance; while good recovery for a very small region (e.g. 28, 29) becomes more challenging. We\nalso plot the held-out risk in the subplot (c). As can be seen, the \ufb01rst few splits lead to the most\nsigni\ufb01cant decreases of the held-out risk. The whole risk curve illustrates a diminishing return be-\nhavior. Correctly splitting the large rectangle leads to a signi\ufb01cant decrease in the risk; in contrast,\nsplitting the middle rectangles does not reduce the risk as much. We also conducted simulations\nwhere the true conditional covariance matrix is a continuous function of x; these are presented in\nthe supplementary materials.\n\n7\n\n\f15\n\n5\n\n13\n\n17\n\n16\n\n42\n\n44\n\n41\n\n43\n\n18 19\n\n21\n\n20\n\n60\n\n62\n\n9\n\n8\n\n59\n\n61\n\n6\n\n39 40\n\n46\n\n48\n\n45\n\n47\n\n51 52\n\n56\n\n58\n\n49 50\n\n55\n\n57\n\n24\n\n23\n\n26\n\n25\n\n66\n\n28\n\n65\n\n64\n\n27\n\n63\n\n34\n\n36\n\n14\n\n38\n\n31\n\n32\n\n33\n\n35\n\n37\n\n4\n\n3\n\n10\n\n12\n\n29\n\n30\n\n11\n\n53\n\n54\n\n22\n\n1\n\n2\n\n7\n\nCO2\n\nDIR\n\nCH4\n\nCO2\n\nDIR\n\nCH4\n\nCO2\n\nDIR\n\nCH4\n\nGLO\n\nCO\n\nGLO\n\nCO\n\nGLO\n\nCO\n\nTMX\n\nTMP\n\nTMN\n\nDTR\n\nTMX\n\nH2\n\nWET\n\nTMP\n\nCLD\n\nTMN\n\nH2\n\nWET\n\nCLD\n\nTMX\n\nTMP\n\nTMN\n\nH2\n\nWET\n\nCLD\n\nVAP\n\nDTR\n\nVAP\n\nFRS\n\nPRE\n\nFRS\n\nPRE\n\nDTR\n\nVAP\n\nFRS\n\nPRE\n\n(a)\n\nCO2\n\nDIR\n\nCH4\n\nGLO\n\nCO\n\nTMX\n\nTMP\n\nTMN\n\nDTR\n\nH2\n\nWET\n\nCLD\n\nVAP\n\nFRS\n\nPRE\n\n(b)\n\n42\n\n44\n\n16\n\n17\n\n18\n\n19\n\n41\n\n43\n\n21\n\n20\n\n8\n\n60\n\n62\n\n59\n\n61\n\n6\n\n46\n\n48\n\n51\n\n52\n\n56\n\n58\n\n24\n\n15\n\n39\n\n40\n\n45\n\n47\n\n49\n\n50\n\n55\n\n57\n\n23\n\n66\n\n65\n\n27\n\n28\n\n64\n\n63\n\n9\n\n26\n\n25\n\n5\n\n34\n\n36\n\n13\n\n31\n\n32\n\n33\n\n35\n\n14\n\n38\n\n37\n\n12\n\n29\n\n30\n\n4\n\n3\n\n1\n\n2\n\n10\n\n11\n\n53\n\n54\n\n22\n\n7\n\n(c)\n\nFigure 2: Analysis of climate data. (a) Learned partitions for the 100 locations and projected to the US map,\nwith the estimated graphs for subregions 3, 10, and 33; (b) Estimated graph with data pooled from all 100\nlocations; (c) the re-scaled partition pattern induced by the learned dyadic tree structure.\n\n5.2 Climate Data Analysis\nIn this section, we apply Go-CART on a meteorology dataset collected in a similar approach as in\n[8]. The data contains monthly observations of 15 different meteorological factors from 1990 to\n2002. We use the data from 1990 to 1995 as the training data and data from 1996 to 2002 as the\nheld-out validation data. The observations span 100 locations in the US between latitudes 30.475 to\n47.975 and longitudes -119.75 to -82.25. The 15 meteorological factors measured for each month\ninclude levels of CO2, CH4, H2, CO, average temperature (TMP) and diurnal temperature range\n(DTR), minimum temperate (TMN), maximum temperature (TMX), precipitation (PRE), vapor (VAP),\ncloud cover (CLD), wet days (WET), frost days (FRS), global solar radiation (GLO), and direct solar\nradiation (DIR).\n\nAs a baseline, we estimate a sparse graph on the data pooled from all 100 locations, using the glasso\nalgorithm; the estimated graph is shown in Figure 2 (b). It is seen that the greenhouse gas factor\nCO2 is isolated from all the other factors. This apparently contradicts the basic domain knowledge\nthat CO2 should be correlated with the solar radiation factors (including GLO, DIR), according to\nthe IPCC report [6] which is one of the most authoritative reports in the \ufb01eld of meteorology. The\nreason for the missing edges in the pooled data may be that positive correlations at one location are\ncanceled by negative correlations at other locations.\nTreating the longitude and latitude of each site as two-dimensional covariate X, and the meteorology\ndata of the p = 15 factors as the response Y , we estimate a dyadic tree structure using the greedy\nalgorithm. The result is a partition with 66 subregions, shown in Figure 2. The graphs for subregions\n3 and 10 (corresponding to the coast of California and Arizona states) are shown in subplot (a)\nof Figure 2. The graphs for these two adjacent subregions are quite similar, suggesting spatial\nsmoothness of the learned graphs. Moreover, for both graphs, CO2 is connected to the solar radiation\nfactor GLO through CH4.\nIn contrast, for subregion 33, which corresponds to the north part of\nArizona, the estimated graph is quite different. In general, it is found that the graphs corresponding\nto the locations along the coasts are sparser than those corresponding to the locations in the mainland.\n\nSuch observations, which require validation and interpretation by domain experts, are examples of\nthe capability of graph-valued regression to provide a useful tool for high dimensional data analysis.\n\n8\n\n\fReferences\n\n[1] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\n\nlikelihood estimation. Journal of Machine Learning Research, 9:485\u2013516, March 2008.\n\n[2] G. Blanchard, C. Sch\u00a8afer, Y. Rozenholc, and K.-R. M\u00a8uller. Optimal dyadic decision trees.\n\nMach. Learn., 66(2-3):209\u2013241, 2007.\n\n[3] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen. Classi\ufb01cation and regression trees.\n\nWadsworth Publishing Co Inc, 1984.\n\n[4] D. Edwards. Introduction to graphical modelling. Springer-Verlag Inc, 1995.\n[5] J. H. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2007.\n\n[6] IPCC. Climate Change 2007\u2013The Physical Science Basis IPCC Fourth Assessment Report.\n[7] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[8] A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatial-\n\ntemporal causal modeling for climate change attribution. In ACM SIGKDD, 2009.\n\n[9] P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu. Model selection in Gaussian graph-\nIn Advances in Neural\n\nical models: High-dimensional consistency of (cid:2)1-regularized MLE.\nInformation Processing Systems 22, Cambridge, MA, 2009. MIT Press.\n\n[10] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance\n\nestimation. Electronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[11] C. Scott and R. Nowak. Minimax-optimal classi\ufb01cation with dyadic decision trees. Information\n\nTheory, IEEE Transactions on, 52(4):1335\u20131353, 2006.\n\n[12] J. Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley, 1990.\n[13] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.\n\nBiometrika, 94(1):19\u201335, 2007.\n\n[14] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning,\n\n78(4), 2010.\n\n9\n\n\f", "award": [], "sourceid": 455, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Xi", "family_name": "Chen", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}