{"title": "Semiparametric Differential Graph Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1072, "abstract": "In many cases of network analysis, it is more attractive to study how a network varies under  different conditions than an individual static network. We propose a novel graphical model, namely Latent Differential Graph Model, where the networks under two different conditions are represented by two semiparametric elliptical distributions respectively, and the variation of these two networks (i.e., differential graph) is characterized by the difference between their latent precision matrices. We propose an estimator for the differential graph based on quasi likelihood maximization with nonconvex regularization. We show that our estimator attains a faster statistical rate in parameter estimation than the state-of-the-art methods, and enjoys oracle property under mild conditions. Thorough experiments on both synthetic and real world data support our theory.", "full_text": "Semiparametric Differential Graph Models\n\nPan Xu\n\nUniversity of Virginia\npx3ds@virginia.edu\n\nQuanquan Gu\n\nUniversity of Virginia\nqg5w@virginia.edu\n\nAbstract\n\nIn many cases of network analysis, it is more attractive to study how a network\nvaries under different conditions than an individual static network. We propose\na novel graphical model, namely Latent Differential Graph Model, where the\nnetworks under two different conditions are represented by two semiparametric\nelliptical distributions respectively, and the variation of these two networks (i.e.,\ndifferential graph) is characterized by the difference between their latent precision\nmatrices. We propose an estimator for the differential graph based on quasi like-\nlihood maximization with nonconvex regularization. We show that our estimator\nattains a faster statistical rate in parameter estimation than the state-of-the-art meth-\nods, and enjoys the oracle property under mild conditions. Thorough experiments\non both synthetic and real world data support our theory.\n\n1\n\nIntroduction\n\nNetwork analysis has been widely used in various \ufb01elds to characterize the interdependencies between\na group of variables, such as molecular entities including RNAs and proteins in genetic networks\n[3]. Networks are often modeled as graphical models. For instance, in gene regulatory network,\nthe gene expressions are often assumed to be jointly Gaussian. A Gaussian graphical model [18] is\nthen employed by representing different genes as nodes and the regulation between genes as edges\nin the graph. In particular, two genes are conditionally independent given the others if and only\nif the corresponding entry of the precision matrix of the multivariate normal distribution is zero.\nNevertheless, the Gaussian distribution assumption, is too restrictive in practice. For example, the\ngene expression values from high-throughput method, even after being normalized, do not follow a\nnormal distribution [19, 26]. This leads to the inaccuracy in describing the dependency relationships\namong genes. In order to address this problem, various semiparametric Gaussian graphical models\n[21, 20] are proposed to relax the Gaussian distribution assumption.\nOn the other hand, it is well-known that the interactions in many types of networks can change under\nvarious environmental and experimental conditions [1]. Take the genetic networks for example, two\ngenes may be positively conditionally dependent under some conditions but negatively conditionally\ndependent under others. Therefore, in many cases, more attention is attracted not by a particular\nindividual network but rather by whether and how the network varies with genetic and environmental\nalterations [6, 15]. This gives rise to differential networking analysis, which has emerged as an\nimportant method in differential expression analysis of gene regulatory networks [9, 28].\nIn this paper, in order to conduct differential network analysis, we propose a Latent Differential Graph\nModel (LDGM), where the networks under two different conditions are represented by two transellip-\ntical distributions [20], i.e., T Ed(\u2303\u21e4X,\u21e0 ; f1, . . . , fd) and T Ed(\u2303\u21e4Y ,\u21e0 ; g1, . . . , gd) respectively. Here\nT Ed(\u2303\u21e4X,\u21e0 ; f1, . . . , fd) denotes a d-dimensional transelliptical distribution with latent correlation\nmatrix \u2303\u21e4X 2 Rd\u21e5d, and will be de\ufb01ned in detail in Section 3. More speci\ufb01cally, the connectivity\nof the individual network is encoded by the latent precision matrix (e.g., \u21e5\u21e4X = (\u2303\u21e4X)1) of the\ncorresponding transelliptical distribution, such that [\u21e5\u21e4X]jk 6= 0 if and only if there is an edge\nbetween the j-th node and the k-th node in the network. And the differential graph is de\ufb01ned as\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe difference between the two latent precision matrices \u21e4 = \u21e5\u21e4Y  \u21e5\u21e4X. Our goal is to estimate\n\u21e4 based on observations sampled from T Ed(\u2303\u21e4X,\u21e0 ; f1, . . . , fd) and T Ed(\u2303\u21e4Y ,\u21e0 ; g1, . . . , gd). A\nsimple procedure is estimating \u21e5\u21e4X and \u21e5\u21e4Y separately, followed by calculating their difference.\nHowever, it requires estimating 2d2 parameters (i.e., \u21e5\u21e4X and \u21e5\u21e4Y ), while our ultimate goal is only\nestimating d2 parameters (i.e., \u21e4). In order to overcome this problem, we assume that the difference\nof the two latent precision matrices, i.e., \u21e4 is sparse and propose to directly estimate it by quasi\nlikelihood maximization with nonconvex penalty. The nonconvex penalty is introduced in order to\ncorrect the intrinsic estimation bias incurred by convex penalty [10, 36]. We prove that, when the\n\ntrue differential graph is s-sparse, our estimator attains O(ps1/n +ps2 log d/n) convergence rate\nin terms of Frobenius norm, which is faster than the estimation error bound O(ps log d/n) of `1,1\n\npenalty based estimator in [38]. Here n is the sample size, s1 is the number of entries in \u21e4 with\nlarge magnitude, s2 is the number of entries with small magnitude and s = s1 + s2. We show that\nour method enjoys the oracle property under a very mild condition. Thorough numerical experiments\non both synthetic and real-world data back up our theory.\nThe remainder of this paper is organized as follows: we review the related work in Section 2. We\nintroduce the proposed model and the non-convex penalty in Section 3, as well as the proposed\nestimator. In Section 4, we present our main theories for estimation in semiparametric differential\ngraph models. Experiments on both synthetic and real world data are provided in Section 5. Section\n6 concludes with discussion.\nNotation For x = (x1, . . . , xd)> 2 Rd and 0 < q < 1, we de\ufb01ne the `0, `q and `1 vector norms\ni=1 |xi|q1/q, and kxk1 = max1\uf8ffi\uf8ffd |xi|, where 1(\u00b7)\nas kxk0 = Pd\nis the indicator function. For A = (Aij) 2 Rd\u21e5d, we de\ufb01ne the matrix `0,0, `1,1, `1,1 and `F\nnorms as: kAk0,0 =Pd\ni,j=1 |Aij|, kAk1,1 = max1\uf8ffi,j\uf8ffd |Aij|,\nand kAkF =qPij |Aij|2. The induced norm for matrix is de\ufb01ned as kAkq = maxkxkq=1 kAxkq,\nfor 0 < q < 1. For a set of tuples S, AS denotes the set of numbers [A(jk)](jk)2S, and vec(S) is\nthe vectorized index set of S.\n\n1 (Aij 6= 0), kAk1,1 =Pd\n\ni=1\n\n1(xi 6= 0),kxkq = Pd\n\ni,j=1\n\n2 Related Work\nThere exist several lines of research for differential network analysis. One natural procedure is to\nestimate the two networks (i.e., two precision matrices) respectively by existing estimators such as\ngraphical Lasso [12] and node-wise regression [25]. Another family of methods jointly estimates\nthe two networks by assuming that they share common structural patterns and therefore uses joint\nlikelihood maximization with group lasso penalty or group bridge penalty [7, 8, 14]. Based on the\nestimated precision matrices, the differential graph can be obtained by calculating their difference.\nHowever, both of these two types of methods suffer from the drawback that they need to estimate\ntwice the number of parameters, and hence require roughly doubled observations to ensure the\nestimation accuracy. In order to address this drawback, some methods are proposed to estimate the\ndifference of matrices directly [38, 35, 22, 11]. For example, [38] proposed a Dantzig selector type\nestimator for estimating the difference of the precision matrices directly. [35] proposed a D-Trace\nloss [37] based estimator for the difference of the precision matrices. Compared with [38, 35], our\nestimator is advantageous in the following aspects: (1) our model relaxes the Gaussian assumption by\nrepresenting each network as a transelliptical distribution, while [38, 35] are restricted to Gaussian\ndistribution. Thus, our model is more general and robust; and (2) by employing nonconvex penalty,\nour estimator achieves a sharper statistical rate than theirs. Rather than the Gaussian graphical model\nor its semiparametric extension, [22, 11] studied the estimation of change in the dependency structure\nbetween two high dimensional Ising models.\n3 Semiparametric Differential Graph Models\nIn this section, we will \ufb01rst review the transelliptical distribution and present our semiparametric\ndifferential graph model. Then we will present the estimator for differential graph, followed by the\nintroduction to nonconvex penalty.\n3.1 Transelliptical Distribution\nTo brie\ufb02y review the transelliptical distribution, we begin with the de\ufb01nition of elliptical distribution.\n\n2\n\n\fDe\ufb01nition 3.1 (Elliptical distribution). Let \u00b5 2 Rd and \u2303\u21e4 2 Rd\u21e5d with rank(\u2303\u21e4) = q \uf8ff d. A\nrandom vector X 2 Rd follows an elliptical distribution, denoted by ECd(\u00b5, \u2303\u21e4,\u21e0 ), if it can be\nrepresented as X = \u00b5 + \u21e0AU, where A is a deterministic matrix satisfying A>A = \u2303\u21e4, U is a\nrandom vector uniformly distributed on the unit sphere in Rq, and \u21e0 ? U is a random variable.\nMotivated by the extension from Gaussian distribution to nonparanormal distribution [21], [20] pro-\nposed a semiparametric extension of elliptical distribution, which is called transelliptical distribution.\nDe\ufb01nition 3.2 (Transelliptical distribution). A random vector X = (X1, X2, . . . , Xd)> 2 Rd\nis transelliptical, denoted by T Ed(\u2303\u21e4,\u21e0 ; f1, . . . , fd), if there exists a set of monotone univariate\nfunctions f1, . . . , fd and a nonnegative random variable \u21e0, such that (f1(X1), . . . , fd(Xd))> follows\nan elliptical distribution ECd(0, \u2303\u21e4,\u21e0 ).\n\n3.2 Kendall\u2019s tau Statistic\nIn semiparametric setting, the Pearson\u2019s sample covariance matrix can be inconsistent in esti-\nmating \u2303\u21e4. Given n independent observations X1, ..., Xn, where Xi = (Xi1, ..., Xid)> \u21e0\nT Ed(\u2303\u21e4,\u21e0 ; f1, . . . , fd), [20] proposed a rank-based estimator, the Kendall\u2019s tau statistic, to es-\ntimate \u2303\u21e4, due to its invariance under monotonic marginal transformations. The Kendall\u2019s tau\nestimator is de\ufb01ned as\n\n(3.1)\n\n2\n\nn(n  1) X1\uf8ffi<i0\uf8ffn\n\nsign\u21e5Xij  Xi0jXik  Xi0k\u21e4.\n\nb\u2327jk =\n\nIt has been shown thatb\u2327jk is an unbiased estimator of \u2327jk = 2/\u21e1 arcsin(\u2303\u21e4jk) [20], and the correlation\nmatrix \u2303\u21e4 can be estimated by b\u2303 = [b\u2303jk] 2 Rd\u21e5d, where\n2b\u2327jk\u2318.\nb\u2303jk = sin\u21e3 \u21e1\nWe use T\u21e4 to denote the matrix with entries \u2327jk andbT with entriesb\u2327jk, for j, k = 1, . . . d.\n\n3.3 Latent Differential Graph Models and the Estimator\nNow we are ready to formulate our differential graph model. Assume that d dimensional random\nvectors X and Y satisfy X \u21e0 T Ed(\u2303\u21e4X,\u21e0 ; f1, . . . , fd) and Y \u21e0 T Ed(\u2303\u21e4Y ,\u21e0 ; g1, . . . , gd). The\ndifferential graph is de\ufb01ned to be the difference of the two latent precision matrices,\n(3.3)\n\n(3.2)\n\n\u21e4 = \u21e5\u21e4Y  \u21e5\u21e4X,\n\nY\n\nwhere \u21e5\u21e4X = \u2303\u21e41\n\nX and \u21e5\u21e4Y = \u2303\u21e41\n\u2303\u21e4X\u21e4\u2303\u21e4Y  (\u2303\u21e4X  \u2303\u21e4Y ) = 0, and \u2303\u21e4Y \u21e4\u2303\u21e4X  (\u2303\u21e4X  \u2303\u21e4Y ) = 0.\n\n(3.4)\nGiven i.i.d. copies X1, . . . , XnX of X, and i.i.d. copies Y1, . . . , YnY of Y , without loss of generality,\nwe assume nX = nY = n, and we denote the Kendall\u2019s tau correlation matrices de\ufb01ned in (3.2) as\n\n. It immediately implies\n\nequation for \n\nb\u2303X and b\u2303Y . Following (3.4), a reasonable procedure for estimating \u21e4 is to solve the following\n\n(3.5)\n\n1\n\n1\n\n2b\u2303Xb\u2303Y +\n\n2b\u2303Y b\u2303X  (b\u2303X b\u2303Y ) = 0,\n\nwhere we add up the two equations in (3.4) and replace the latent population correlation matrices\n\n\u2303\u21e4X, \u2303\u21e4Y with the Kendall\u2019s tau estimators b\u2303X, b\u2303Y . Note that (3.5) is a Z-estimator [30], which can\nbe translated into a M-estimator, by noticing that 1/2b\u2303Xb\u2303Y + 1/2b\u2303Y b\u2303X  (b\u2303X b\u2303Y ) can\n\nbe seen as a score function of the following quasi log likelihood function\n\n(3.6)\nLet S = supp(\u21e4), in this paper, we assume that \u21e4 is sparse, i.e., |S|\uf8ff s with s > 0. Based on\n(3.6), we propose to estimate \u21e4 by the following M-estimator with non-convex penalty\n\ntr(b\u2303Y b\u2303X)  tr(b\u2303X b\u2303Y ).\ntr(b\u2303Y b\u2303X)  tr(b\u2303X b\u2303Y ) + G(),\n\n1\n2\n\nb = argmin\n\n2Rd\u21e5d\n\n`() =\n\n(3.7)\n\n1\n2\n\n3\n\n\fwhere > 0 is a regularization parameter and G is a decomposable nonconvex penalty function,\ni.e., G() =Pd\nj,k=1 g(jk), such as smoothly clipped absolute deviation (SCAD) penalty [10]\nor minimax concave penalty (MCP) [36]. The key property of the nonconvex penalty is that it can\navoid over-penalization when the magnitude is very large. It has been shown in [10, 36, 33] that\nthe nonconvex penalty is able to alleviate the estimation bias and attain a re\ufb01ned statistical rate of\nconvergence. The nonconvex penalty g() can be further decomposed as the sum of the `1 penalty\nand a concave component h(), i.e., g() = || + h(). Take MCP penalty for example. The\ncorresponding g() and h() are de\ufb01ned as follows\n\ng() = Z ||\n\nz\n\nb\u25c6+\n0 \u27131 \n1(||\uf8ff b) +\u2713 b2\n\nh() = \n\n2\n2b\n\ndz, for any  2 R,\n\n2  ||\u25c6 1(|| > b ).\n\nwhere > 0 is the regularization parameter and b > 0 is a \ufb01xed parameter, and\n\nIn Section 4, we will show that the above family of nonconvex penalties satis\ufb01es certain common\nregularity conditions on g() as well as its concave component h().\nWe will show in the next section that when the parameters of the nonconvex penalty are appropriately\nchosen, (3.7) is an unconstrained convex optimization problem. Thus it can be solved by the proximal\n\ngradient descent [4] very ef\ufb01ciently. In addition, it is easy to check that the estimator b from (3.7) is\n\nsymmetric. So it does not need the symmetrizing process adopted in [38], which can undermine the\nestimation accuracy.\n4 Main Theory\nIn this section, we present our main theories. Let S = supp(\u21e4) be the support of the true differential\ngraph. We introduce the following oracle estimator of \u21e4:\n\nbO = argmin\n\nsupp()\u2713S\n\n`(),\n\n(4.1)\n\nestimator, since we do not know the true support in practice. An estimator is said to have the oracle\n\nwhere `() = 1/2 tr(b\u2303Y b\u2303X)tr(b\u2303X b\u2303Y ). The oracle estimator bO is not a practical\nproperty, if it is identical to the oracle estimator bO under certain conditions. We will show that our\nestimator enjoys the oracle property under a mild condition.\nWe \ufb01rst lay out some assumptions that are required through our analysis.\nAssumption 4.1. There exist constants \uf8ff1,\uf8ff 2 > 0 such that \uf8ff1 \uf8ff min(\u2303\u21e4X) \uf8ff max(\u2303\u21e4X) \uf8ff 1/\uf8ff1\nand \uf8ff2 \uf8ff min(\u2303\u21e4Y ) \uf8ff max(\u2303\u21e4Y ) \uf8ff 1/\uf8ff2. The true covariance matrices have bounded `1 norm, i.e.,\nk\u2303\u21e4Xk1 \uf8ff X, k\u2303\u21e4Y k1 \uf8ff Y , where X, Y > 0 are constants. And the true precision matrices have\nbounded matrix `1-norm, i.e., k\u21e5\u21e4Xk1 \uf8ff \u2713X and k\u21e5\u21e4Y k1 \uf8ff \u2713Y , where \u2713X,\u2713 Y > 0 are constants.\nThe \ufb01rst part of Assumption 4.1 requires that the smallest eigenvalues of the correlation \u2303\u21e4X, \u2303\u21e4Y are\nbounded below from zero, and their largest eigenvalues are \ufb01nite. This assumptions is commonly\nimposed in the literature for the analysis of graphical models [21, 27].\nAssumption 4.2. The true difference matrix \u21e4 = \u2303\u21e41\nY  \u2303\u21e41\nX has s nonzero entries, i.e.,\nk\u21e4k0,0 \uf8ff s and has bounded `1,1 norm, i.e., k\u21e4k1,1 \uf8ff M, where M > 0 does not depend on d.\nAssumption 4.2 requires the differential graph to be sparse. This is reasonable in differential network\nanalysis where the networks only vary slightly under different conditions.\nThe next assumption is about regularity conditions on the nonconvex penalty g(). Recall that g()\ncan be written as g() = || + h().\nAssumption 4.3. g() and its concave component h() satisfy:\n\n(a) There exists a constant \u232b such that g0() = 0, for || \u232b> 0.\n(b) There exists a constant \u21e3  0 such that h() + \u21e3/2 \u00b7 2 is convex.\n\n4\n\n\f(c) h() and h0() pass through the origin, i.e., h(0) = h0(0) = 0.\n(d) h0() is bounded, i.e., |h0()|\uf8ff  for any .\n\nX\u27132\n\nSimilar assumptions have been made in [23, 33]. Note that condition (b) in Assumption 4.3 is weaker\nthan the smoothness condition in [33], since here it does not require h() to be twice differentiable.\nAssumption 4.3 holds for a variety of nonconvex penalty functions including MCP and SCAD. In\nparticular, MCP penalty satis\ufb01es Assumption 4.3 with \u232b = b and \u21e3 = 1/b. Furthermore, according\nto condition (b), if \u21e3 is smaller than the modulus of the restricted strong convexity for `(), (3.7)\nwill become a convex optimization problem, even though G() is nonconvex. Take MCP for\nexample, this can be achieved by choosing a suf\ufb01ciently large b in MCP such that \u21e3 is small enough.\nNow we are ready to present our main theories. We \ufb01rst show that under a large magnitude condition\non nonzero entries of the true differential graph \u21e4, our estimator attains a faster convergence rate,\nwhich matches the minimax rate in the classical regime.\nTheorem 4.4. Suppose Assumptions 4.1 and 4.2 hold, and the nonconvex penalty G() sat-\nis\ufb01es conditions in Assumption 4.3.\nIf nonzero entries of \u21e4 satisfy min(j,k)2S |\u21e4jk| \u232b +\nC\u2713 2\n = 2CMplog d/n and \u21e3 \uf8ff \uf8ff1\uf8ff2/2, we have that\nb  \u21e41,1 \uf8ff 2p10\u21e1\u27132\nkb  \u21e4kF \uf8ff\n\nY XY Mr log s\nholds with probability at least 1  2/s. Furthermore, we have that\n\uf8ff1\uf8ff2r s\n\nY XY Mplog s/n, for the estimator b in (3.7) with the regularization parameter satisfying\n\nholds with probability at least 1  3/s, where C1 is an absolute constant.\nRemark 4.5. Theorem 4.4 suggests that under the large magnitude assumption, the statistical rate of\nour estimator is O(ps/n) in terms of Frobenius norm. This is faster than the rate O(ps log d/n) in\n\n[38] which matches the minimax lower bound for sparse differential graph estimation. Note that our\nfaster rate is not contradictory to the minimax lower bound, because we restrict ourselves to a smaller\nclass of differential graphs, where the magnitude of the nonzero entries is suf\ufb01ciently large.\n\nC1M\n\nX\u27132\n\nn\n\nn\n\nWe further show that our estimator achieves oracle property under mild conditions.\n\nX\u27132\n\nTheorem 4.6. Under the same conditions of Theorem 4.4, for the estimator b in (3.7) and the oracle\nestimator bO in (4.1), we have with probability at least 1 3/s that b = bO, which further implies\nsupp(b) = supp(bO) = supp(\u21e4).\nTheorem 4.6 suggests that our estimator is identical to the oracle estimator in (4.1) with high proba-\nY XY Mplog s/n.\nbility, when the nonzero entries in \u21e4 satisfy min(j,k)2S |\u21e4jk| \u232b + C\u2713 2\nThis condition is optimal up to the logarithmic factor plog s.\nNow we turn to the general case when the nonzero entries of \u21e4 have both large and small magnitudes.\nDe\ufb01ne Sc = {(j, k) : j, k = 1, . . . , d} \\ S, S1 = {(j, k) 2 S : |\u21e4jk| >\u232b }, and S2 = {(j, k) 2 S :\n|\u21e4jk|\uf8ff \u232b}. Denote |S1| = s1 and |S2| = s2. Clearly, we have s = s1 + s2.\nTheorem 4.7. Suppose Assumptions 4.1 and 4.2 hold, and the nonconvex penalty G() satis\ufb01es\nconditions in Assumption 4.3. For the estimator in (3.7) with the regularization parameter  =\n2CMplog d/n and \u21e3 \uf8ff \uf8ff1\uf8ff2/4, we have that\n16p3\u21e1M\n\uf8ff1\uf8ff2 r s1\n\n\uf8ff1\uf8ff2 r s2 log d\nn\nholds with probability at least 1  3/s1, where C is an absolute constant.\nRemark 4.8. Theorem 4.7 indicates that when the large magnitude condition does not hold, our\nestimator is still able to attain a faster rate. Speci\ufb01cally, for those nonzero entries of \u21e4 with large\nmagnitude, the estimation error bound in terms of Frobenius norm is O(ps1/n), which is the same\n\nkb  \u21e4kF \uf8ff\n\n10\u21e1M C\n\n+\n\nn\n\n5\n\n\fas the bound in Theorem 4.4. For those nonzero entries of \u21e4 with small magnitude, the estimation\n\nerror is O(ps2 log d/n), which matches the convergence rate in [38]. Overall, our estimator obtains\na re\ufb01ned rate of convergence rate O(ps1/n +ps2 log d/n), which is faster than [38]. In particular,\n\nif s\u21e42 = 0, the re\ufb01ned convergence rate in Theorem 4.7 reduces to the faster rate in Theorem 4.4.\n\n5 Experiments\nIn this section, we test our method on both synthetic and real world data. We conducted experiments\nfor our estimator using both SCAD and MCP penalties. We did not \ufb01nd any signi\ufb01cant difference\nin the results and thus we only report the results of our estimator with MCP penalty. To choose\nthe tuning parameters  and b, we adopt 5-fold cross-validation. Denoting our estimator with MCP\npenalty by LDGM-MCP, we compare it with the following methods: (1) SepGlasso: estimating the\nlatent precision matrices separately using graphical Lasso and Kendall\u2019s tau correlation matrices [20],\nfollowed by calculating their difference; (2) DPM: directly estimating differential precision matrix\n[38]. In addition, we also test differential graph model with `1,1 penalty, denoted as LDGM-L1.\nNote that LDGM-L1 is a special case of our method, since `1,1 norm penalty is a special case of\nMCP penalty when b = 1. The LDGM-MCP and LDGM-L1 estimators are obtained by solving the\nproximal gradient descent algorithm [4]. The implementation of DPM estimator is obtained from the\nauthor\u2019s website, and the SepGlasso estimator is implemented by graphical Lasso.\n\n5.1 Simulations\nWe \ufb01rst show the results on synthetic data. Since the transelliptical distribution includes Gaussian\ndistribution, it is natural to show that our approach also works well for the latter one. We consider\nthe dimension settings n = 100, d = 100 and n = 200, d = 400 respectively. Speci\ufb01cally, data\nare generated as follows: (1) For the Gaussian distribution, we generate data {Xi}n\ni=1 \u21e0 N (0, \u2303\u21e4X)\ngenerated by huge package 1.\nand {Yi}n\n(2) For the transelliptical distribution, we consider the following generating scheme: {Xi}n\ni=1 \u21e0\nT Ed(\u2303\u21e4X,\u21e0 ; f1, . . . , fd), {Yi}n\n1 (\u00b7) = . . . =\nd (\u00b7) = sign(\u00b7)| \u00b7 |1/2. The latent precision matrices \u2303\u21e41\nf1\nd = sign(\u00b7)| \u00b7 |3 and g1\nand \u2303\u21e41\nare generated in the same way as the Gaussian data. For both Gaussian and transelliptical\ndifferential graph mdoels, we consider two settings for individual graph structures: (1) both \u2303\u21e41\nX and\n\u2303\u21e41\nhas a \"random\" structure.\n\ni=1 \u21e0 T Ed(\u2303\u21e4Y ,\u21e0 ; g1, . . . , gd), where \u21e0 \u21e0 d, f1\n\ni=1 \u21e0 N (0, \u2303\u21e4Y ) with precision matrices \u2303\u21e41\n\nhave \"random\" structures; (2) \u2303\u21e41\n\n1 (\u00b7) = . . . = g1\n\nX has a \"band\" structure, \u2303\u21e41\n\nX and \u2303\u21e41\n\nX\n\nY\n\nY\n\nY\n\nY\n\nGiven an estimator b, we de\ufb01ne the true positive and negative rates of b as\n\nTP = Pd\n\nj,k=1\n\n1(bjk 6= 0 and \u21e4jk 6= 0)\nPd\n\n1(\u21e4jk 6= 0)\n\nj,k=1\n\n,\n\nTN = Pd\n\nj,k=1\n\n1(bjk = 0 and \u21e4jk = 0)\nPd\n\n1(\u21e4jk = 0)\n\nj,k=1\n\n.\n\nThe receiver operating characteristic (ROC) curves for transelliptical differential graph models are\nshown in Figure 1, which report the performances of different methods on support recovery. The\nROC curves were plotted by averaging the results over 10 repetitions. From Figure 1 we can see\nour estimator (LDGM-MCP) outperforms other methods in all settings. In addition, LDGM-L1 as a\nspecial case of our estimator also performs better than DPM and SepGlasso, although it is inferior to\nLDGM-MCP because the MCP penalty can correct the bias in the estimation and achieve faster rate\nof convergence. Note that SepGlasso\u2019s performace is poor since it highly depends on the sparsity of\nboth individual graphs. When n > 100, the DPM method failed to output the solution in one day\nand thus no result was presented. This computational burden is also stated in their paper. We use\n\nthe Frobenius norm kb  \u21e4kF and in\ufb01nity norm kb  \u21e4k1,1 of estimation errors to evaluate\n\nthe performances of different methods in estimation. The results averaged over 10 replicates for\ntranselliptical differential graph are summarized in Tables 1 and 2 respectively. Our estimator also\nachieves smaller error than the other baselines in all settings. Due to the space limit, we defer the\nexperiment results for Gaussian differential graph model to the appendix.\n\n1Available on http://cran.r-project.org/web/packages/huge\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nP\nT\n\n0\n\n0\n\n0.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nP\nT\n\n0\n\n0\n\n0.2\n\nSepGlasso\nDPM\nLDGM-L1\nLDGM-MCP\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1-TN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nP\nT\n\n0\n\n0\n\n0.2\n\nSepGlasso\nDPM\nLDGM-L1\nLDGM-MCP\n\n0.4\n\n0.6\n\n1-TN\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nP\nT\n\n0\n\n0\n\n0.2\n\nSepGlasso\nLDGM-L1\nLDGM-MCP\n\n0.4\n\n0.6\n\n1-TN\n\n0.8\n\n1\n\nSepGlasso\nLDGM-L1\nLDGM-MCP\n\n0.4\n\n0.6\n\n1-TN\n\n0.8\n\n1\n\n(a) Setting 1: n=100,d=100\n\n(b) Setting 2: n=100,d=100\n\n(c) Setting 1: n=200,d=400\n\n(d) Setting 2:n=200,d=400\n\nFigure 1: ROC curves for transelliptical differential graph models of all the 4 methods. There are two\nsettings of graph structure. Note that DPM is not scalable to d = 400.\n\nential graph models. N/A means the algorithm did not output the solution in one day.\n\nTable 1: Comparisons of estimation errors in Frobenius norm kb  \u21e4kF for transelliptical differ-\n\nn = 100, d = 100\n\nSetting 1\n\nSetting 2\n\nn = 200, d = 400\n\nSetting 1\n\nSetting 2\n\nMethods\nSepGlasso\nDPM\nLDGM-L1\nLDGM-MCP\n\n13.5730\u00b10.6376\n12.7219\u00b10.3704\n12.0738\u00b10.4955\n11.2831\u00b10.3919\n\n25.6664\u00b10.6967\n23.0548\u00b10.2669\n22.3748\u00b10.6643\n19.6154\u00b10.5106\n\n39.9847\u00b10.1856\n31.7630\u00b10.0715\n28.8676\u00b10.1425\nTable 2: Comparisons of estimation errors in in\ufb01nity norm kb  \u21e4k1,1 for transelliptical\n\ndifferential graph models. N/A means the algorithm did not output the solution in one day.\n\n22.1760\u00b10.3839\n20.6537\u00b10.3778\n20.1071\u00b10.4303\n\nN/A\n\nN/A\n\nn = 100, d = 100\n\nn = 200, d = 400\n\nMethods\nSepGlasso\nDPM\nLDGM-L1\nLDGM-MCP\n\nSetting 1\n\n2.7483\u00b10.0575\n2.3138\u00b10.0681\n2.2193\u00b10.0850\n1.7010\u00b10.0149\n\nSetting 2\n\n8.0522\u00b10.1423\n6.3250\u00b10.0560\n6.0716\u00b10.1150\n4.6522\u00b10.1337\n\nSetting 1\n\nN/A\n\n2.1409\u00b10.0906\n1.8876\u00b10.0907\n1.7339\u00b10.0061\n\nSetting 2\n\nN/A\n\n6.0108\u00b10.1925\n5.1858\u00b10.0218\n4.0133\u00b10.0521\n\n5.2 Experiments on Real World Data\nWe applied our approach to the same gene expression data used in [38], which were collected from\npatients with stage III or IV ovarian cancer. [29] identi\ufb01ed six molecular subtypes of ovarian cancer\nin this data, labeled C1 through C6. In particular, the C1 subtype was found to have much shorter\nsurvival times, and was characterized by differential expression of genes associated with stromal and\nimmune cell types. In this experiment, we intended to investigate whether the C1 subtype was also\nassociated with the genetic differential networks. The subjects were divided into two groups: Group\n1 with n1 = 78 patients containing C1 subtype, and Group 2 with n2 = 113 patients containing\nC2 through C6 subtypes. We analyzed two pathways from the KEGG pathway database [16, 17]\nrespectively. In each pathway, we applied different methods to determine whether there is any\ndifference in the conditional dependency relationships of the gene expression levels between the\naforementioned Group 1 and Group 2. Two genes were connected in the differential network if their\nconditional dependency relationship given the others changed in either magnitude or sign. In order to\nobtain a clear view of the differential graph, we only plotted genes whose conditional dependency\nwith others changed between the two groups. To interpret the results, the genes associated with more\nedges in the differential networks were considered to be more important.\nFigure 2 shows the results of estimation for the differential graph of the TGF- pathway, where the\nnumber of genes d = 80 is greater than n1, the sample size of Group 1. LDGM-MCP identi\ufb01ed two\nimportant genes, COMP and THBS2, both of which have been suggested to be related to resistance to\nplatinum-based chemotherapy in epithelial ovarian cancer by [24]. LDGM-L1 suggested that COMP\n\n7\n\n\fBMPR1B\n\n\u25cf\n\nID4\n\n\u25cf\n\nBMP4\n\n\u25cf\n\n\u25cf\n\nID3\nID1\n\n\u25cf\n\nID2\n\n\u25cf\n\nBMP7\n\n\u25cf\n\nINHBA\n\n\u25cf\n\nTHBS1\n\n\u25cf\n\nCOMP\nTHBS2\nDCN\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nSMAD7\n\n\u25cf\n\nID1\n\n\u25cf\n\nTHBS2\n\n\u25cf\n\nID2\n\n\u25cf\n\nID2\n\n\u25cf\n\nTHBS2\n\n\u25cf\n\nPITX2\n\n\u25cf\n\nTHBS1\n\n\u25cf\n\nCOMP\n\n\u25cf\n\nCDKN2B\n\n\u25cf\n\nTHBS2\n\n\u25cf\n\nDCN\n\n\u25cf\n\nID1\n\n\u25cf\n\nID3\n\n\u25cf\n\nTHBS1\n\n\u25cf\n\nCOMP\n\n\u25cf\n\nSMAD7\n\n\u25cf\n\nINHBA\n\n\u25cf\n\nCOMP\n\n\u25cf\n\nBMP7\n\n\u25cf\n\n(a) SepGlasso\n\n(b) DPM\n\n(c) LDGM-L1\n\n(d) LDGM-MCP\n\nFigure 2: Estimates of the differential networks between Group 1 and Group 2. Dataset: KEGG\n04350, TGF- pathway.\n\nFAS\n\n\u25cf\n\nENDOG\n\n\u25cf\n\nTP53\n\n\u25cf\n\nAIFM1\n\n\u25cf\n\nBIRC3\n\n\u25cf\n\nPRKAR2B\n\n\u25cf\n\nENDOG\n\n\u25cf\n\nTNFSF10\n\n\u25cf\n\nPIK3R1\n\n\u25cf\n\nAIFM1\n\n\u25cf\n\nBIRC3\n\n\u25cf\n\nPIK3R1\n\n\u25cf\n\nTNFSF10\n\nPRKAR2B\n\n\u25cf\n\nCSF2RB\n\n\u25cf\n\nIL1B\n\n\u25cf\n\n\u25cf\n\nAIFM1\n\n\u25cf\n\nTNFSF10\n\n\u25cf\n\nBIRC3\n\n\u25cf\n\n\u25cf\n\nBIRC3\n(a) SepGlasso\n\nIL1R1\n\n\u25cf\n\nFAS\n\n\u25cf\n\nPIK3R1\n\n\u25cf\n\nCSF2RB\n\n\u25cf\n\n(b) DPM\n\n(c) LDGM-L1\n\n(d) LDGM-MCP\n\nTNFSF10\n\n\u25cf\n\nENDOG\n\n\u25cf\n\nCSF2RB\n\n\u25cf\n\nFigure 3: Estimates of the differential networks between Group 1 and Group 2. Dataset: KEGG\n04210, Apoptosis pathway.\n\nwas important, and DPM also suggested COMP and THBS2. Separate estimation (SepGlasso) gave a\nrelatively dense network, which made it hard to say which genes are more important.\nFigure 3 shows the results for the Apoptosis pathway, where the number of genes d = 87 is also\ngreater than n1. LDGM-MCP indicated that TNFSF10 and BIRC3 were the most important. Indeed,\nboth TNFSF10 and BRIC3 have been widely studied for use as a therapeutic target in cancer [5, 32].\nLDGM-L1 and DPM also suggested TNFSF10 and BRIC3 were important. The results of LDGM-\nMCP, LDGM-L1 and DPM are comparable. In order to overcome the nonsparsity issue encountered in\nTGF- experiment, the SepGlasso estimator was thresholded more than the other methods. However,\nit still performed poorly and identi\ufb01ed the wrong gene CSF2RB.\n6 Conclusions\nIn this paper, we propose a semiparametric differential graph model and an estimator for the differen-\ntial graph based on quasi likelihood maximization. We employ a nonconvex penalty in our estimator,\nwhich results in a faster rate for parameter estimation than existing methods. We also prove that the\nproposed estimator achieves oracle property under a mild condition. Experiments on both synthetic\nand real world data further support our theory.\nAcknowledgments We would like to thank the anonymous reviewers for their helpful comments.\nResearch was supported by NSF grant III-1618948.\n\nReferences\n[1] BANDYOPADHYAY S, K. D. E. A., MEHTA M (2010). Rewiring of genetic networks in response to dna\n\ndamage. Science 330 1385\u20131389.\n\n[2] BARBER, R. F. and KOLAR, M. (2015). Rocket: Robust con\ufb01dence intervals via kendall\u2019s tau for\n\ntranselliptical graphical models. arXiv preprint arXiv:1502.07641 .\n\n[3] BASSO, K., MARGOLIN, A. A., STOLOVITZKY, G., KLEIN, U., DALLA-FAVERA, R. and CALIFANO, A.\n\n(2005). Reverse engineering of regulatory networks in human b cells. Nature genetics 37 382\u2013390.\n\n[4] BECK, A. and TEBOULLE, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences 2 183\u2013202.\n\n[5] BELLAIL A C, M. P. E. A., QI L (2009). Trail agonists on clinical trials for cancer therapy: the promises\n\nand the challenges. Reviews on recent clinical trials 4 34\u201341.\n\n8\n\n\f[6] CARTER S L, G. M. E. A., BRECHB\u00dcHLER C M (2004). Gene co-expression network topology provides\n\na framework for molecular characterization of cellular state. Bioinformatics 20 2242\u20132250.\n\n[7] CHIQUET, J., GRANDVALET, Y. and AMBROISE, C. (2011). Inferring multiple graphical structures.\n\nStatistics and Computing 21 537\u2013553.\n\n[8] DANAHER, P., WANG, P. and WITTEN, D. M. (2014). The joint graphical lasso for inverse covariance\n\nestimation across multiple classes. Journal of the Royal Statistical Society: Series B 76 373\u2013397.\n\n[9] DE LA FUENTE, A. (2010). From \u2018differential expression\u2019to \u2018differential networking\u2019\u2013identi\ufb01cation of\n\ndysfunctional regulatory networks in diseases. Trends in genetics 26 326\u2013333.\n\n[10] FAN, J. and LI, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.\n\nJournal of the American statistical Association 96 1348\u20131360.\n\n[11] FAZAYELI, F. and BANERJEE, A. (2016). Generalized direct change estimation in ising model structure.\n\n[12] FRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2008). Sparse inverse covariance estimation with the\n\n[13] GOLUB, G. H. and LOAN, C. F. V. (1996). Matrix computations (3rd ed.). Johns Hopkins University\n\n[14] GUO, J., LEVINA, E., MICHAILIDIS, G. and ZHU, J. (2011). Joint estimation of multiple graphical\n\narXiv preprint arXiv:1606.05302 .\n\ngraphical lasso. Biostatistics 9 432\u2013441.\n\nPress, Baltimore, MD, USA.\n\nmodels. Biometrika asq060.\n\n[15] HUDSON, N. J., REVERTER, A. and DALRYMPLE, B. P. (2009). A differential wiring analysis of\nexpression data correctly identi\ufb01es the gene containing the causal mutation. PLoS Comput Biol 5 e1000382.\n[16] KANEHISA, M. and GOTO, S. (2000). Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids\n\nresearch 28 27\u201330.\n\n[17] KANEHISA, M., GOTO, S., SATO, Y., FURUMICHI, M. and TANABE, M. (2011). Kegg for integration\n\nand interpretation of large-scale molecular data sets. Nucleic acids research gkr988.\n\n[18] LAURITZEN, S. L. (1996). Graphical models. Clarendon Press.\n[19] LI, P., PIAO, Y., SHON, H. S. and RYU, K. H. (2015). Comparing the normalization methods for the\n\ndifferential analysis of illumina high-throughput rna-seq data. BMC bioinformatics 16 1.\n[20] LIU, H., HAN, F. and ZHANG, C.-H. (2012). Transelliptical graphical models. In NIPS.\n[21] LIU, H., LAFFERTY, J. and WASSERMAN, L. (2009). The nonparanormal: Semiparametric estimation of\n\nhigh dimensional undirected graphs. The Journal of Machine Learning Research 10 2295\u20132328.\n\n[22] LIU, S., SUZUKI, T. and SUGIYAMA, M. (2014). Support consistency of direct sparse-change learning in\n\n[23] LOH, P.-L. and WAINWRIGHT, M. J. (2013). Regularized m-estimators with nonconvexity: Statistical\n\nmarkov networks. arXiv preprint arXiv:1407.0581 .\n\nand algorithmic theory for local optima. In NIPS.\n\n[24] MARCHINI, E. A., SERGIO (2013). Resistance to platinum-based chemotherapy is associated with\nepithelial to mesenchymal transition in epithelial ovarian cancer. European journal of cancer 49 520\u2013530.\n[25] MEINSHAUSEN, N. and B\u00dcHLMANN, P. (2006). High-dimensional graphs and variable selection with the\n\n[26] OSHLACK, A., ROBINSON, M. D., YOUNG, M. D. ET AL. (2010). From rna-seq reads to differential\n\nlasso. The annals of statistics 1436\u20131462.\n\nexpression results. Genome biol 11 220.\n\n[27] RAVIKUMAR, P., WAINWRIGHT, M. J., RASKUTTI, G., YU, B. ET AL. (2011). High-dimensional\n\ncovariance estimation by minimizing `1-penalized log-determinant divergence. EJS 5 935\u2013980.\n\n[28] TIAN, D., GU, Q. and MA, J. (2016). Identifying gene regulatory network using latent differential\n\ngraphical models. Nucleic Acids Research 44 e140\u2013e140.\n\n[29] TOTHILL R W, G. J. E. A., TINKER A V (2008). Novel molecular subtypes of serous and endometrioid\n\novarian cancer linked to clinical outcome. Clinical Cancer Research 14 5198\u20135208.\n\n[30] VAN DER VAART, A. V. (1998). Asymptotic statistics. Cambridge University Press, Cambridge, UK.\n[31] VERSHYNIN, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027 .\n\n[32] VUCIC, D. and FAIRBROTHER, W. J. (2007). The inhibitor of apoptosis proteins as therapeutic targets in\n\ncancer. Clinical Cancer Research 13 5995\u20136000.\n\n[33] WANG, Z., LIU, H. and ZHANG, T. (2014). Optimal computational and statistical rates of convergence\n\nfor sparse nonconvex learning problems. Annals of statistics 42 2164.\n\n[34] WEGKAMP, M. and ZHAO, Y. (2013). Adaptive estimation of the copula correlation matrix for semipara-\n\nmetric elliptical copulas. arXiv preprint arXiv:1305.6526 .\n\n[35] YUAN, H., XI, R. and DENG, M. (2015). Differential network analysis via the lasso penalized d-trace\n\nloss. arXiv preprint arXiv:1511.09188 .\n\n[36] ZHANG, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of\n\n[37] ZHANG, T. and ZOU, H. (2014). Sparse precision matrix estimation via lasso penalized d-trace loss.\n\n[38] ZHAO, S. D., CAI, T. T. and LI, H. (2014). Direct estimation of differential networks. Biometrika 101\n\nStatistics 894\u2013942.\n\nBiometrika ast059.\n\n253\u2013268.\n\n9\n\n\f", "award": [], "sourceid": 612, "authors": [{"given_name": "Pan", "family_name": "Xu", "institution": "University of Virginia"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "University of Virginia"}]}