{"title": "Bayesian Joint Estimation of Multiple Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 9802, "page_last": 9812, "abstract": "In this paper, we propose a novel Bayesian group regularization method based on the spike and slab Lasso priors for jointly estimating multiple graphical models. The proposed method can be used to estimate the common sparsity structure underlying the graphical models while capturing potential heterogeneity of the precision matrices corresponding to those models. Our theoretical results show that the proposed method enjoys the optimal rate of convergence in $\\ell_\\infty$ norm for estimation consistency and has a strong structure recovery guarantee even when the signal strengths over different graphs are heterogeneous. Through  simulation studies and an application to the capital bike-sharing network data, we demonstrate the competitive performance of our method compared to existing alternatives.", "full_text": "Bayesian Joint Estimation of\nMultiple Graphical Models\n\nLingrui Gan, Xinming Yang, Naveen N. Nariestty, Feng Liang\n\nDepartment of Statistics\n\nUniversity of Illinois at Urbana-Champaign\n\n{lgan6, xyang104, naveen, liangf}@illinois.edu\n\nAbstract\n\nIn this paper, we propose a novel Bayesian group regularization method based on\nthe spike and slab Lasso priors for jointly estimating multiple graphical models. The\nproposed method can be used to estimate common sparsity structure underlying\nthe graphical models while capturing potential heterogeneity of the precision\nmatrices corresponding to those models. Our theoretical results show that the\nproposed method enjoys the optimal rate of convergence in (cid:96)8 norm for estimation\nconsistency and has a strong structure recovery guarantee even when the signal\nstrengths over different graphs are heterogeneous. Through simulation studies\nand an application to the capital bike-sharing network data, we demonstrate the\ncompetitive performance of our method compared to existing alternatives.\n\n1\n\nIntroduction\n\nGaussian graphical models (GGMs) are widely studied from both the frequentist [30, 9, 3, 18, 28]\nand the Bayesian perspectives [4, 7, 26, 1, 19, 10, 14]. A GGM model assumes that a collection of\nvariables jointly follows a multivariate Gaussian distribution with an unknown precision matrix. It is\nwell known that there is a one-to-one correspondence between the sparsity pattern of the precision\nmatrix of a Gaussian distribution and the graph that describes the conditional dependence structure\namong the variables: Non-zero entries in the precision matrix correspond to edges in the graph [6].\nDue to this connection, given data from a GGM, we are interested in estimating not only the precision\nmatrix but also its support.\nIn many applications, observations are naturally grouped into different classes. For example, in\nbiological experiments, subjects are classi\ufb01ed into categories based on their experimental conditions;\nin social network data, users are grouped by users\u2019 characteristics; and in gene expression analysis,\nexpression data are classi\ufb01ed into different tissues or disease states. In such situations, it is restrictive\nto assume that all observations follow the same graphical model, i.e., have the same precision matrix,\nand it would be more suitable to assume that different classes have different precision matrices.\nMeaningful insights can be more effectively extracted, if we utilize the cross-class similarities of the\nprecision matrices and estimate graphs for the multiple classes jointly.\nSeveral approaches have been proposed for jointly estimating multiple GGMs. From the penalized\nlikelihood perspective, [12, 5, 13, 17] extended approaches for estimating a single graph to the\nmultiple-graph setting by introducing group-level penalty terms and studied the theoretical properties\nof these approaches. From the Bayesian perspective, Peterson et al. [21] proposed a Markov random\n\ufb01eld prior on multiple graphs to encourage the selection of common edges in related graphs. Tan\net al. [25] proposed to use a Chung-Lu random graph model as the prior for hierarchical modeling of\nmultiple GGMs. However, theoretical guarantees of the Bayesian methods are not available.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we develop a Bayesian approach for jointly estimating multiple GGMs under the\nassumption that the multiple precision matrices share a common sparsity structure but they can\nhave heterogeneous signal magnitudes. We provide theoretical results showing that the maximum\na posteriori (MAP) estimators have the optimal rate of convergence in (cid:96)8 norm, even under model\na\nmis-speci\ufb01cation when precision matrices for different classes do not share a common sparsity\nstructure. When the multiple GGMs do share a common sparsity structure, the proposed approach is\nproved to be consistent in recovering such a structure, even when some of the within group signals\nare weaker than the p\nplog pq{nq rate, the minimal signal strength usually required for consistency\nresults on structure recovery .\nThe remaining part of the paper is organized as follows. The Bayesian formulation and parameter\nestimation procedure for our model are provided in Section 2. Theoretical guarantees of our approach\nare presented in Section 3 and empirical studies are provided in Section 4.\n\n2 Method\n\nLet Y1, . . . , YK denote the data from K classes, where the k-th dataset Yk consists of nk observations\npYk,1, . . . , Yk,nkq, each of which is a p-dimensional vector. Throughout we assume the p variables are\ncommon across the K classes and data from each class follow a p-dimensional Gaussian distribution:\n\nOur target is to estimate the K precision matrices \u0398k \u201c p\u03b8k,ijq, k P t1, . . . , Ku, and recover their\ncommon sparsity structure.\n\nYk,1, . . . , Yk,nk \u201e Npp0, \u0398\u00b41\nk q.\n\n2.1 Bayesian Formulation\n\nFor regularized estimation and sparsity recovery, we shall place prior distributions on the \u0398k\u2019s by \ufb01rst\nintroducing binary latent variables \u03b3ij which indicate whether nodes i and j have an edge (\u03b3ij \u201c 1)\nor not (\u03b3ij \u201c 0). A Bernoulli prior on \u03b3ij and a spike and slab prior on the off-diagonal elements of\n\u0398k are placed as follows:\n\npp\u03b8k,ij|\u03b3ijq \u201c\n\n(2.1)\nwhere v1 \u0105 v0 \u0105 0 and LPp\u00a8; vq is the Laplace distribution with parameter v. For \u03b3ij \u201c 1, \u03b8k,ij\u2019s\nrepresent signals modeled by a slab distribution with high variance, and for \u03b3ij \u201c 0, \u03b8k,ij\u2019s represent\nnoise modeled by a spike distribution with mass tightly centered around zero. Integrating over \u03b3ij, we\nhave the following multivariate spike and slab prior on the group of K entries \u03b8ij \u201c p\u03b81,ij, . . . , \u03b8K,ijq:\n\n\u03b3ij \u201e Bernpp1q,\n\n\"\nLPp\u03b8k,ij; v1q when \u03b3ij \u201c 1,\nLPp\u03b8k,ij; v0q when \u03b3ij \u201c 0,\n\nK\u017a\n\nLPp\u03b8k,ij; v1q ` p1 \u00b4 p1q K\u017a\n\npp\u03b8ijq \u201c p1\n\nLPp\u03b8k,ij; v0q.\n\nk\u201c1\n\n(2.2)\nWhen K \u201c 1, the distribution above is the one-dimensional spike and slab Lasso prior utilized for\nlinear regression by [23, 24] and for analyzing a single GGM by [10].\nWe impose the aforementioned spike and slab Lasso prior (2.2) on the upper triangular of the precision\nmatrices, i.e., on \u03b8ij for i \u0103 j, and enforce \u03b8ji \u201c \u03b8ij to keep \u0398k symmetric. In addition, we place\nindependent exponential priors on the positive diagonal entries to introduce a small shrinkage. In\nsummary, the Bayesian prior formulation of our model is as follows:\n\nk\u201c1\n\nK\u017a\n\u03b8k,ii \u201e Expp\u03c4q,\n\u03b8ij \u201e p1\nwhere i \u0103 j and k \u201c 1, . . . , K.\n\nk\u201c1\n\nLPp\u03b8k,ij; v1q ` p1 \u00b4 p1q K\u017a\n\nk\u201c1\n\nLPp\u03b8k,ij; v0q,\n\n(2.3)\n\n2.2 Parameter Estimation\nWe \ufb01rst focus on the point estimation of \u0398 \u201c p\u03981, . . . , \u0398Kq and then discuss the posterior inference\nof the sparsity structure conditional on the point estimator (discussed in Section 2.3). Motivated\n\n2\n\n\fby [16, 10], we estimate \u0398 by solving the following optimization problem under the constraint of\n\u2126 \u201c t\u0398 : \u0398k \u0105 0, }\u0398k}2 \u010f B, k \u201c 1, . . . , Ku where \u0398k \u0105 0 indicates that \u0398k is positive de\ufb01nite\nand }\u0398k}2 denotes the spectral norm of \u0398k. Our constrained MAP estimator is given by\n\n\u02c6\u0398 \u201c arg min\n\u0398P\u2126\n\np\u00b4 log ppY | \u0398q ` \u03b1Penp\u0398qq , \u03b1 \u011b 1,\nnk\u00ff\n\nwhere log ppY | \u0398q is the log likelihood and Penp\u0398q is the negative log of the prior on \u0398:\n\u02d9\n\nplog detp\u0398kq \u00b4 trpSk\u0398kqq , Sk \u201c 1\nnk\n\nlog ppY | \u0398q \u201c K\u00ff\nPenp\u0398q \u201c p\u00ff\n\n\u03c4 \u03b8k,ii `\n\nYk,iY T\nk,i,\n\n\u00b4 log\n\n\u00ff\n\n\u00b4 }\u03b8ij}1\n\nnk\n2\n\nK\u00ff\n\n\u00b4 }\u03b8ij}1\n\ni\u201c1\nv1 ` 1 \u00b4 p1\np2v0qK e\n\n$\u2019\u2019\u2019\u2019\u2019&\u2019\u2019\u2019\u2019\u2019%\n\np2v1qK e\n\n\u02c6\n\nk\u201c1\n\np1\n\nv0\n\n(2.4)\n\n,\n\ni\u201c1\n\nk\u201c1\n\ni\u0103j\n\nwith Sk being the sample covariance matrix of the k-th class and }\u03b8ij}1 being the (cid:96)1 norm of the\nvector \u03b8ij. From the Bayesian viewpoint, our estimator (2.4) is equivalent to the MAP estimator of\nthe posterior pp\u0398 | Y q 9 ppY | \u0398qpp\u0398q\u03b1, where the prior is raised to the power of \u03b1 so that its\nin\ufb02uence on inference can be appropriately magni\ufb01ed. From the penalized likelihood viewpoint, we\nare essentially multiplying the penalty function Penp\u0398q by a multiplier of \u03b1, which is equivalent\nto scaling the log likelihood by 1{\u03b1. Such an adjustment is commonly adopted to develop optimal\ntheoretical results, for example, by [8, 31, 16].\nFor scalability, we propose to compute the MAP estimator instead of sampling from the full posterior.\nFull posterior sampling for high-dimensional GGMs is computationally expensive, for example, in\n[21, 25], the dimension p in all empirical studies is at most 22 due to the computational limitations.\nAlthough we only propose a point estimator (2.4) for the precision matrices, our model is still\nformulated from a Bayesian perspective with a continuous spike and slab prior distribution. While\nthis prior does not directly place mass on sparse solutions as in [27, 29], the latent binary indicators\n\u03b3ij introduced can distinguish between \u201csignal\u201d and \u201cnoise\u201d. This is a common technique used in\nthe Bayesian literature [11, 20, 23, 10] to avoid the computational bottleneck of degenerate priors.\nFurthermore, with the Bayesian machinery, we are able to extract the posterior inclusion probabilities\nfor structure recovery in Section 2.3 and provide strong guarantees for graph selection in Section 3.3.\nThe term Penp\u0398q, which is induced from our prior speci\ufb01cation, acts as a non-convex penalty function.\nThe non-convexity of the penalty brings desired shrinkage effects, as shown in our theoretical results\nin Section 3 and prior results in the literature [31, 16, 15]. However, it may cause the whole objective\nfunction to be non-convex, and consequently, we need to deal with multiple local solutions to the\nminimization problem (2.4). In the following theorem, we show that when being constrained to the\nparameter space \u2126, the minimization problem (2.4) is in fact strictly convex. Thus, the solution \u02c6\u0398 of\nthe objective function (2.4) will be unique. Proof of this result is in Supplementary Material.\nTheorem 1 If B \u0103\nRemark. The upper bound on B can increase with sample size n. When we establish the selection\nconsistency of \u02c6\u0398 in Theorem 2, we will require the order of v2\n. Therefore,\n, which can go to `8 as long as the order of \u03b1 is greater\nthe upper bound on B is O\nthan K log p.\n\n2nv2\n\u03b1K , then the constrained minimization problem (2.4) is strictly convex.\n0\n\n\u00af\n\u03b1{pK log pq\n\n\u03b12{pn log pq\n\n\u00b4a\n\n0 to be O\n\nb\n\n`\n\n\u02d8\n\n2.3 Common Structure Recovery\n\nUtilizing the hierarchical structure (2.1) of the spike and slab prior, we make inference on the common\nsparsity structure based on the following posterior inclusion probability (PIP)\n\nPp\u03b3ij \u201c 1 | \u02c6\u03b8ijq \u201c\n\nWe can estimate the common sparsity structure by thresholding the PIP, e.g., with t \u201c 1{2:\n\n!\n\u00b4p 1\n\n1\n\n) \ufb01 pp \u02c6\u03b8ijq.\n\nv0\n\np1\n\np v1\n\nqK exp\n\n1 ` 1\u00b4p1\n!\npi, jq : pp \u02c6\u03b8ijq \u0105 t,\n\nv1\n\nv0\n\n\u00b4 1\n\nq} \u02c6\u03b8ij}1\n)\nfor t P p0, 1q\n\n.\n\n\u02c6S \u201c\n\n(2.5)\n\n(2.6)\n\n3\n\n\fNote that the PIP is a function of } \u02c6\u03b8ij}1, so that the signal strength in the whole group is utilized\ntogether to estimate the common sparsity pattern. Even when some individual entries are of small\nsignal strength, the information shared within its group could help us to identify the shared structure.\nOur theoretical results provided in Section 3 con\ufb01rm that this strategy will be indeed bene\ufb01cial for\nrecovering such signals.\n\n3 Theoretical Guarantees\n\nIn this section, we develop the theoretical properties of the proposed estimator \u02c6\u0398 including estimation\naccuracy and structure recovery consistency. For simplicity, we assume the sample sizes of the K\nclasses are the same with n1 \u201c \u00a8\u00a8\u00a8 \u201c nK \u201c n in the theoretical analysis.\nNotations: For a square matrix Ap\u02c6p \u201c paijq, we denote its element-wise (cid:96)8 norm by }A}8 \u201c\nmax1\u010fi,j\u010fp |aij|; its Frobenius norm by }A}F ; and its spectral norm by }A}2. We denote the largest\neigenvalue and smallest eigenvalue of A by \u03bbmaxpAq and \u03bbminpAq, respectively. When A is a square\nsymmetric matrix, we note }A}2 \u201c \u03bbmaxpAq. For a collection of K square matrices of the same\ndimension A \u201c pA1, . . . , AKq, write }A}8 \u201c sup1\u010fk\u010fK }Ak}8. Let \u03980 \u201c p\u03980\nKq denote\nk,ij \u2030 0u denote the index set of nonzero\nthe collection of true precision matrices and S 0\nk as dk \u201c maxi cardptj :\nentries in the true precision matrix \u03980\nk,ij \u2030 0uq where cardp\u00a8q denotes the cardinality of a set and let d \u201c maxk dk.\n\u03b80\n3.1 Conditions\n\nk. De\ufb01ne column sparsity of \u03980\n\nk \u201c tpi, jq : \u03b80\n\n1, . . . , \u03980\n\nIn our theoretical analysis, we do not restrict the observed data to follow a Gaussian distribution.\nThus, our Bayesian hierarchical model (2.3) could be treated as a working model. The observed data\nare allowed to be from any distribution with exponential tails (e.g., sub-Gaussian distributions) or\npolynomial tails (e.g., t distributions), which is the same setup considered in [3, 10] when the class\nppq\nsize K \u201c 1. Speci\ufb01cally, for all the p-dimensional random vectors Yk,i \u201c pY\nk,i q, i \u201c\n1, . . . , nk and k \u201c 1, . . . , K, we have the following assumptions:\n(A.1) Exponential tail condition: there exist some constants 0 \u0103 \u03b7 \u0103 1{4 and U \u0105 0 such that\n\np1q\nk,i , . . . , Y\n\nplog pq{n \u0103 \u03b7 and\n\nEpetY\n\npjq\nk,i q \u010f U for any |t| \u010f \u03b7 and j \u201c 1, . . . , p;\n\n(3.1)\n(A.2) Polynomial tail condition: there exist some constants \u03ba1, \u03ba2, \u03ba3, U \u0105 0 such that p \u010f \u03ba1n\u03ba2\n(3.2)\n\npjq\nk,i |4`4\u03ba2`\u03ba3 \u010f U for j \u201c 1, . . . , p.\n\nE|Y\n\nand\n\nWe shall establish estimation and selection consistency of our method when the true data distribution\nsatis\ufb01es (A.1) or (A.2) and is not necessarily a multivariate Gaussian distribution. Note that when the\ndata indeed follows a multivariate Gaussian distribution, (A.1) is satis\ufb01ed.\nIn addition, we assume that the eigenvalues of the true precision matrices are bounded:\n(A.3) Eigenvalue condition: 1{\u03be0 \u010f \u03bbminp\u03980\n\nkq \u010f 1{\u03be1 for k \u201c 1, . . . , K.\n\nkq \u010f \u03bbmaxp\u03980\n\n3.2 Estimation Accuracy\n\nThe following theorem establishes the rate of convergence of the proposed estimator under (cid:96)8 norm.\nFor this result, we do not require the different precision matrices to have the same sparsity structure.\na\nTheorem 2 Suppose one of the tail conditions, (A.1) or (A.2), holds and the true precision matrices\nsatisfy (A.3). Let C1 \u201c \u03b7\u00b41p2 ` \u03ba0 ` \u03b7\u00b41U 2q when the exponential tail condition (A.1) holds and\nC1 \u201c\np}\u03980}8 ` 1qp4 ` \u03ba0q when the polynomial tail condition (A.2) holds for some \u03ba0 \u0105 0. In\nb\naddition, assume that\n(i) the hyperparameters pv1, v0, p1, \u03c4q satisfy:\nn log p\n, 1\n\u03b12\nv0\np1\u00b4p1q\n\u010f vK`2\nvK`2\n\nb\n\u0105 C4\n\u010f 2p\u00010{\u03b1;\n\n$&%maxp 3\n\n, 2\u03c4q \u0103 C3\nv1\n1 p1\u00b4p1q\nvK\n0 p1\n\n\u00012 \u0103 vK\n\nn log p\n\n\u03b12\n\np1\n\n,\n\n1\n\n0\n\n4\n\n\f(ii) the sample size n satis\ufb01es:\n(iii) the bounds on the spectral norms of the estimated precision matrices satisfy:\n\nlog p;\n\n?\n\nn \u011b M0 maxpd,\n\nc\n\n?\n\nKq?\n\u02c6\n\n1{\u03be1 ` dC5\n\nlog p\n\n\u0103 B \u0103\n\n2nv2\n0\n\u03b1K\n\n\u02d91{2\n\n;\n\nn\n(iv) the parameter \u03b1 satis\ufb01es: \u03b1p\u00010{\u03b1 \u0105 KC 2\nThen, the minimizer \u02c6\u0398 is unique and satis\ufb01es\n\n3 log p{p2\u03be2\n1q.\nc\n\n} \u02c6\u0398 \u00b4 \u03980}8 \u0103 C5\n\nwith probability greater than 1 \u00b4 K\u03b4, where \u03b4 \u201c 2p\u00b4\u03ba0 when condition (A.1) holds, and \u03b4 \u201c\nOpn\u00b4\u03ba3{8 ` p\u00b4\u03ba0{2q when condition (A.2) holds. Moreover,\nc. Here,\nC3, \u00012 are suf\ufb01ciently small positive constants , M0, C4, C5, \u00010 are positive constants only depend on\nthe ground truth \u03980.\n\n\u201c 0 for pi, jq P\n\nS 0\n\n\u02c6\u0398k\n\nij\n\nk\n\nlog p\n\nn\n\n,\n\n\u00b4\n\n\u00af\n\n`\n\n\u02d8\n\nOur proof is motivated by the constructive proof technique used in [22] and [10]. Details of the\nde\ufb01nitions of M0, C4, C5, \u00010 and the proof of Theorem 2 are provided in Supplementary Material.\nCondition (i), which is related to the rates of the hyperparameters, controls the level of shrinkage\nof the penalty function Penp\u0398q. With a proper choice of the hyperparameters, our penalty function\ninduces an appropriate adaptive shrinkage effect: the shrinkage is strong enough when the magnitude\nof \u03b8 is small to kill the noise and produce exact zero, and is insigni\ufb01cant when the magnitude of \u03b8 is\nlarge so that the bias is controlled. Condition (ii) is on the relationship between the sample size n and\nthe number of variables p, and p could grow nearly exponentially with n. Condition (iii) deals with\na\nthe parameter space of the constrained optimization problem, which ensures both the feasibility and\nconvexity of the problem. Under these conditions, our Theorem 2 states that, as long as the parameter\n\u03b1 satis\ufb01es the condition (iv), the error rate of every entry of the estimated precision matrices is at\nmost Opp\n\nplog pq{nq.\n\n3.3 Sparsity Structure Recovery Consistency\n\na\nBesides the estimation accuracy, another important task is to identify the sparsity structure of the\nprecision matrices as it tells the conditional dependence relationships between the p variables of\nplog pq{n for\ninterest. If the minimal signal strength satis\ufb01es mink mini\u2030j,pi,jqPS 0\nsome suf\ufb01ciently large constant L0, Theorem 2 directly gives rise to the result that our estimator \u02c6\u0398k\nhas the same sparsity structure as the truth S 0\nk with probability converging to 1, even when different\nclasses do not have the same sparsity structure. If all precision matrices share a common sparsity\nK \u201c S 0, then our proposed method achieves selection consistency with\nstructure, i.e., S 0\na weaker condition on the minimal signal strength as stated in the following theorem.\n\n1 \u201c \u00a8\u00a8\u00a8 \u201c S 0\n\nk,ij|q \u0105 L0\n\np|\u03b80\n\nk\n\na\nplog pq{n,\n\nTheorem 3 Suppose conditions in Theorem 2 all hold. In addition, assume that:\n(v) the minimal signal strength satis\ufb01es\n\nmin\npi,jqPS 0\n\n|\u03b80\nk,ij|q \u011b L0\nwhere L0 \u0105 C5 is a suf\ufb01ciently large constant;\n(vi) rates of the hyperparameters v1, v0, and p1 satisfy\n\u0105\n\npmax\n\u02d9\n\nK \u011b 1 \u00b4 t\n\n\u02c6\n\nk\n\n1 \u00b4 p1\np1\n\nv1\nv0\n\nppC4\u00b4C3qpL0\u00b4C5q{\u03b1\nwhere t is an arbitrary thresholding value between 0 and 1. Then we have\n\nt\n\n,\n\n1\u00b4p1\np1\n\np v1\n\nv0\n\nqK\n\nPp \u02c6S \u201c S 0q \u00d1 1.\n\n5\n\n\fCondition (v) is the condition on the minimal signal strength. Compared to similar conditions required\nby approaches that estimate each GGM individually, this condition is weaker since it only places\nrequirements on the largest signal within each group. Therefore, the whole group would bene\ufb01t from\none large signal. Under the weaker minimal signal strength condition and with appropriate choice of\nthe hyperparameters satisfying condition (vi), we can differentiate between the \u201csignal\u201d and \u201cnoise\u201d\ngroups with probability going to 1. A proof of Theorem 3 is provided in the Supplementary Material.\n\n3.4 Comparisons with Existing Works\n\nn\n\n\u02d9\n\n\u0159\n\nK\n\npp`q1q log p\n\n\u02c6b\n\nk\u201c1 } \u02dc\u0398k \u00b4 \u03980\n\nk}F \u201c Op\n\n, where q1 \u201c cardpYktS 0\n\nIn this section, we compare our theoretical results in estimation accuracy and selection consistency\nwith other alternatives [12, 13]. In the following discussion, we use \u02dc\u0398 as a generic notation to denote\nestimators proposed by others.\nGuo et al. [12] established the estimation accuracy of their estimator \u02dc\u0398 in Frobenius norm for a \ufb01xed\nkuq\u00b4 p. We note that\nK value:\nour Theorem 2 gives rise to the same rate as theirs under Frobenius norm. For recovering the graph\nstructure, Guo et al. [12] obtained sparsistency, i.e., the zero entries in the true precision matrices\nare estimated as zeroes with probability tending to one. However, there is no guarantee that the\nnonzero entries could be detected. This is weaker than our Theorem 3 as we recover the entire graph\nstructure. Moreover, to achieve sparsistency, Guo et al. [12] require the minimum signal strength\nk,ij|q to be lower bounded by some constant while we allow it to go to zero.\n\u00b4\nmink mini\u2030j,pi,jqPS 0\nLee and Liu [13] established the estimation accuracy of their estimator \u02dc\u0398 in the averaged version\nof the (cid:96)8-(cid:96)1 norm: maxi,j\n. Our estimation error rate\nfrom Theorem 2 is on the maximum over all entries of all precision matrices without averaging,\nand therefore is stronger. In particular, their result is a direct consequence of ours. For selection\nconsistency, the major difference between theirs and ours is the condition on the signal strength.\nk,ij|q to be lower bounded at the rate of\nLee and Liu [13] implicitly require mink mini\u2030j,pi,jqPS 0\nKplog p{nq1{2, where K is the class size, while we only require a smaller signal strength plog p{nq1{2.\nk,ij|q, which is weaker\nIn addition, our requirement is on the lower bound of mink maxi\u2030j,pi,jqPS 0\nthan requirement on the lower bound of mink mini\u2030j,pi,jqPS 0\n\n\u02c7\u02c7\u02c7\u02dc\u03b8k,ij \u00b4 \u03b80\n\n\u02c6b\n\np|\u03b80\n\nk\n\np|\u03b80\n\nk\n\np|\u03b80\n\nk\n\n\u201c Op\n\n1\nK\n\nK\nk\u201c1\n\n\u02c7\u02c7\u02c7\u00af\n\n\u0159\n\n\u02d9\n\nlog p\n\nn\n\np|\u03b80\n\nk,ij|q.\n\nk\n\nk,ij\n\n4 Numerical Studies\n\n4.1 Computation: an EM Algorithm\nFor computation, we propose an EM algorithm by treating \u0393 \u201c p\u03b3ijq as latent variables and estimating\n\u0398 by applying the following two steps iteratively:\n\u2022 E-step: Calculate the posterior distribution Pp\u03b3ij \u201c 1 | \u0398ptqq :\u201c pijp\u03b8\n\nptq\nij q, which follows the\nformula in (2.5), and compute the so-called Q function, the expectation of the full log-likelihood\nwith respect to Pp\u03b3ij \u201c 1 | \u0398ptqq:\n\n\ufb00\n\n+\n\n\u03c4 \u03b8k,ii\u00b4\n\ni\u201c1\n\nptq\nij q\npijp\u03b8\nv1\n\n` 1 \u00b4 pijp\u03b8\n\nv0\n\nptq\nij q\n\n|\u03b8k,ij|\n\n.\n\nQp\u0398q \u201c K\u00ff\n\nk\u201c1\n\n#\n\nnk\n2\u03b1\n\nplog detp\u0398kq\u00b4trpSk\u0398kqq\u00b4 p\u00ff\n\n\u00ab\n\n\u00ff\n\ni\u0103j\n\n\u2022 M-step: The Q function is a summation of K terms with each to be a weighted graphical Lasso\n[9] problem. Therefore, in the M-step, we maximize the Q function within in the parameter space\n\u2126, utilizing algorithms for graphical Lasso. As a result, the computational complexity of our EM\nalgorithm is Opp3q, which is as ef\ufb01cient as the state-of-the-art algorithms for graphical Lasso\nproblems [9, 10].\n\nDerivations and implementation details of the algorithm are provided in the Supplementary Material.\n\n6\n\n\f4.2 Simulation Results\n\nFollowing the simulation setups in [12, 5, 21, 13], we assess the performance of our proposed method\nunder six different settings: three nearest-neighbor networks and three scale-free networks. The\ndetails of the settings are described as follows.\n\n1. Nearest-neighbor network: we randomly generate p points on a unit square and \ufb01nd the o nearest\nneighbors of each point in terms of the Euclidean distance. The baseline nearest-neighbor network\nis constructed by linking any two points which are the o-nearest neighbors of each other. Larger o\ninduces a denser network and here, we use o \u201c 3. After that, we generate K individual networks\nby adding \u03c1M individual edges to the baseline graph with M to be number of edges in the baseline\ngraph and \u03c1 \u201c 0, 0.25, 0.5.\nGiven a network structure, we generate the corresponding precision matrix \u0398k by assigning\nones on diagonal entries, zeros on entries not corresponding to network edges, and values\nfrom a uniform distribution with support on r\u00b41,\u00b40.5s Y r0.5, 1s on entries corresponding\nto edges. To ensure positive de\ufb01niteness, we then divide each off-diagonal element \u03b8k,ij by\n1.01\n\nb\u0159\nj:j\u2030i |\u03b8k,ij|.\n\nb\u0159\n\ni:i\u2030j |\u03b8k,ij|\n\n2. Scale-free network: many real-world large networks, such as the world wide web, social networks,\nand collaboration networks, are thought to be scale-free. We construct the baseline scale-free\nnetwork using the Barab\u00e1si-Albert model [2]. Next, individual networks and corresponding\nprecision matrices are generated in the same way as in the \ufb01rst design.\n\nIn each setting, we set K \u201c 3 and p \u201c nk \u201c 100, and, for each k P t1, . . . , Ku, we generate nk\nindependently and identically distributed observations from a multivariate Gaussian distribution\nwith mean 0 and precision matrix \u0398k. We compare our method with \u03b1 \u201c 1 and \u03b1 \u201c n with three\ndifferent methods: \ufb01tting each class individually by BAGUS (denoted as BAGUS) [10]; ignoring\nthe class information and \ufb01tting a single model by BAGUS (denoted as Pooled); the group graphical\nLasso (denoted as GGL) [5]. Bayesian approaches based on full posterior sampling [21, 25] are\na\nnot considered for comparison as their Markov chain Monte Carlo (MCMC) samplers are not\nscalable with large p. For all methods, we use a grid search to select the set of hyperparamters that\na\nminimizes BIC. For BAGUS and Pooled methods, we follow the same tuning procedure in [10] and\ntune the spike and slab prior parameters pv0, v1q with v0 \u201c p0.25, 0.5, 0.75, 1q \u02c6\n1{pn log pq and\nv1 \u201c p2.5, 5, 7.5, 10q \u02c6\n1{pn log pq. For GGL, we tune the two penalty parameters p\u03bb1, \u03bb2q as in\n[5] with \u03bb1 \u201c p0.1, 0.2, . . . , 1q and \u03bb2 \u201c p0.1, 0.3, 0.5q.\nTo compare the performance of the methods, we calculate speci\ufb01city (Spec), sensitivity (Senc),\nMatthews correlation coef\ufb01cient (MCC), area under the ROC curve (AUC), Frobenius norm (F-norm),\nand element-wise (cid:96)8 norm ((cid:96)8 norm) for each class. In Table 1-2, we report the maximum of (cid:96)8\nnorm and the average of the other measures over the K classes and the results are aggregated based\non 100 replications. From the results, we observe that our method performs the best in all the designs\nin terms of both selection accuracy (MCC and AUC) and estimation accuracy (F-norm and (cid:96)8 norm).\nEven when \u03c1 \u2030 0, that is, the sparsity patterns over classes are different, which deviates from our\nassumption, our method still has the best performance.\nThe average computational times of all the methods using a MacBook Pro with 2.9 GHz Intel Core\ni5 processor and 8.00 GB memory are reported in Table 3. The computational time of our method\nis comparable to the competitors except the Pooled method, which restrictively assumes the same\nprecision matrix for all classes and has much worse performance compared to our method. Therefore,\nour method is competitive even after considering the runtimes.\n\n4.3 Application to Capital Bikeshare Data\n\nWe use Capital Bikeshare trip data1 to evaluate the performance of the proposed method. The data\ncontains records of bike rentals in a bicycle sharing system with more than 500 stations. We consider\np \u201c 237 stations located in Washington, D.C. and record the number of rentals started at these\nstations for every day in 2016, 2017 and 2018. Following the same processing procedure in [32],\nwe remove the seasonal trend and marginally transform each station\u2019s data to a normal distribution.\n\n1Data available at https://www.capitalbikeshare.com/system-data\n\n7\n\n\fTable 1: Result of nearest-neighbor network\n\nSpec\n\nSens\n\nMCC\n\nAUC\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0\n\n1.000(0.000)\n1.000(0.000)\n0.994(0.002)\n0.989(0.003)\n0.948(0.008)\n\n0.994(0.001)\n0.992(0.003)\n0.988(0.003)\n0.976(0.004)\n0.966(0.010)\n\n0.920(0.038)\n0.993(0.008)\n0.816(0.039)\n0.664(0.056)\n0.707(0.074)\n\n0.955(0.022)\n0.991(0.009)\n0.794(0.033)\n0.616(0.048)\n0.401(0.044)\n\n0.974(0.017)\n0.996(0.004)\n0.903(0.022)\n0.840(0.029)\n0.845(0.038)\n\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0.25\n\n0.823(0.034)\n0.889(0.021)\n0.813(0.030)\n0.571(0.045)\n0.769(0.043)\n\n0.803(0.015)\n0.823(0.025)\n0.732(0.025)\n0.472(0.029)\n0.552(0.054)\n\n0.954(0.014)\n0.943(0.010)\n0.917(0.017)\n0.783(0.024)\n0.879(0.022)\n\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0.5\n\nF-norm\n\n(cid:96)8 norm\n\n2.576(0.221)\n2.094(0.150)\n3.184(0.190)\n7.115(0.380)\n6.338(0.382)\n\n0.503(0.083)\n0.449(0.099)\n0.551(0.093)\n0.983(0.035)\n0.604(0.037)\n\n2.862(0.145)\n2.867(0.154)\n3.372(0.148)\n6.179(0.256)\n5.274(0.122)\n\n0.443(0.066)\n0.443(0.074)\n0.591(0.102)\n0.871(0.104)\n0.529(0.029)\n\n0.992(0.002)\n0.986(0.008)\n0.986(0.002)\n0.976(0.003)\n0.980(0.007)\n\n0.664(0.043)\n0.770(0.043)\n0.710(0.030)\n0.469(0.031)\n0.684(0.077)\n\n0.699(0.023)\n0.713(0.035)\n0.667(0.023)\n0.421(0.027)\n0.608(0.028)\n\n0.920(0.030)\n0.882(0.020)\n0.878(0.014)\n0.777(0.033)\n0.838(0.038)\n\n3.170(0.170)\n3.256(0.112)\n3.707(0.146)\n5.538(0.208)\n4.940(0.256)\n\n0.426(0.050)\n0.427(0.043)\n0.587(0.089)\n0.735(0.111)\n0.502(0.026)\n\nTable 2: Result of scale-free network\n\nSpec\n\nSens\n\nMCC\n\nAUC\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0\n\n1.000(0.000)\n0.996(0.002)\n0.997(0.001)\n0.958(0.003)\n0.938(0.007)\n\n0.993(0.001)\n0.991(0.002)\n0.990(0.001)\n0.959(0.004)\n0.959(0.006)\n\n1.000(0.002)\n0.976(0.014)\n0.995(0.004)\n0.746(0.043)\n1.000(0.001)\n\n0.993(0.006)\n0.906(0.047)\n0.936(0.019)\n0.429(0.027)\n0.483(0.022)\n\n1.000(0.000)\n0.988(0.007)\n0.998(0.002)\n0.903(0.018)\n1.000(0.001)\n\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0.25\n\n0.921(0.024)\n0.914(0.020)\n0.919(0.021)\n0.654(0.040)\n0.964(0.013)\n\n0.833(0.011)\n0.808(0.021)\n0.801(0.019)\n0.415(0.027)\n0.591(0.029)\n\n0.992(0.003)\n0.955(0.010)\n0.967(0.009)\n0.833(0.021)\n0.980(0.007)\n\nn \u201c 100, p \u201c 100, \u03c1 \u201c 0.5\n\nF-norm\n\n(cid:96)8 norm\n\n1.664(0.088)\n1.942(0.133)\n1.747(0.096)\n7.148(0.300)\n5.043(0.282)\n\n0.514(0.117)\n0.432(0.092)\n0.492(0.107)\n0.869(0.024)\n0.545(0.019)\n\n2.032(0.079)\n2.365(0.083)\n2.407(0.100)\n6.331(0.229)\n4.705(0.137)\n\n0.454(0.087)\n0.435(0.050)\n0.518(0.088)\n0.799(0.040)\n0.540(0.024)\n\n0.988(0.001)\n0.983(0.007)\n0.986(0.003)\n0.972(0.003)\n0.978(0.007)\n\n0.787(0.031)\n0.816(0.042)\n0.822(0.023)\n0.508(0.032)\n0.847(0.041)\n\n0.719(0.018)\n0.696(0.036)\n0.716(0.028)\n0.405(0.025)\n0.672(0.035)\n\n0.958(0.009)\n0.904(0.020)\n0.938(0.011)\n0.761(0.030)\n0.921(0.021)\n\n2.548(0.088)\n2.813(0.099)\n3.106(0.131)\n5.816(0.178)\n4.808(0.275)\n\n0.437(0.058)\n0.443(0.046)\n0.595(0.116)\n0.741(0.052)\n0.539(0.030)\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\nTable 3: Average computational time (in seconds) based on 10 replications.\n\nNearest-neighbor Network\n\nOur method p\u03b1 \u201c 1q\nOur method p\u03b1 \u201c nq\nBAGUS\nPooled\nGGL\n\n\u03c1 \u201c 0\n\n3.667(0.040)\n7.792(0.456)\n3.635(0.023)\n1.211(0.010)\n8.715(0.314)\n\n\u03c1 \u201c 0.25\n3.645(0.087)\n4.596(0.643)\n3.572(0.027)\n1.178(0.013)\n8.034(0.689)\n\n\u03c1 \u201c 0.5\n\n\u03c1 \u201c 0\n\n3.552(0.026)\n3.597(0.049)\n3.547(0.021)\n1.169(0.008)\n5.482(1.528)\n\n3.556(0.037)\n5.285(2.623)\n3.553(0.012)\n1.184(0.015)\n8.086(0.262)\n\nScale-free Network\n\n\u03c1 \u201c 0.25\n3.545(0.030)\n3.600(0.025)\n3.546(0.022)\n1.173(0.008)\n6.139(0.678)\n\n\u03c1 \u201c 0.5\n\n3.537(0.033)\n3.578(0.023)\n3.534(0.018)\n1.168(0.010)\n3.074(0.270)\n\nWe divide the observations into K \u201c 3 classes by year as it is natural to expect the precision matrix\nchanges over year due to annual policy decisions, economic conditions, and other aspects of the\nbusiness. Then, we take the \ufb01rst 80% of observations in each class as training data and the other 20%\nas test data.\nWe apply our method with \u03b1 \u201c 365 as well as other methods we compared in the simulation studies,\ni.e., BAGUS, Pooled, and GGL, on the training data to estimate \u00b5k\u2019s and \u0398k\u2019s. For year k and\nday i, we divide the data Yk,i \u201c py\nq and\nYk,i2 \u201c py\nq. Assuming the \ufb01rst half Yk,i1 is observed, we predict the second half\nYk,i2 by the following best linear predictor derived from the multivariate Gaussian distribution:\nk11pYk,i1 \u00b4 \u02c6\u00b5k1q, for k \u201c 1, 2, 3, and i P Tk,\n\u02c6\u0398\u00b41\n\n\u02c6Yk,i2 \u201c EpYk,i2 | Yk,i1q \u201c \u02c6\u00b5k2 ` \u02c6\u0398k21\n\nq into two parts, Yk,i1 \u201c py\n\np1q\nk,i , . . . , y\n\np237q\nk,i\n\np1q\nk,i , . . . , y\n\np118q\nk,i\n\np119q\nk,i\n\n, . . . , y\n\np237q\nk,i\n\n8\n\n\fFigure 1: Averaged AAFE versus the\ntotal number of nonzero off-diagonal en-\ntries in the estimated precision matrices.\n\nFigure 2: Degree distributions of the estimated common\nstation networks over three years by our method and\nBAGUS.\n\n(a) Our method\n\n(b) BAGUS\n\nwhere Tk is the index set of the k-th class of test data, \u00b5k \u201c p\u00b5k1, \u00b5k2q and \u0398k \u201c\nWe use the average absolute forecast error (AAFE) of each class for performance comparison:\n\n\u0398k11 \u0398k12\n\u0398k21 \u0398k22\n\n\u02c6\n\n\u02d9\n\n.\n\n237\u00ff\n\n\u00ff\n\niPTk\n\nAAFEk \u201c 1\n119\n\n1\n\ncardpTkq\n\nj\u201c119\n\npjq\n|\u02c6y\nk,i \u00b4 y\n\npjq\nk,i|, k \u201c 1, 2, 3.\n\nIn Figure 1, we plot the averaged AAFE versus the number of nonzero off-diagonal entries in the\nestimated precision matrices. For our method, BAGUS, and Pooled methods, we plot the curves by\n\ufb01xing v1 and varying v0. For GGL, we \ufb01x the ratio between its two tuning parameters and varying\nthem together. Different ratios would output similar curves and only one of them is plotted. We\nobserve that our method not only achieves the lowest averaged AAFE, but also outputs the sparsest\nestimated precision matrices when the lowest averaged AAFE is attained.\nTo get estimates for the station networks, we select the hyperparameters of our method and BAGUS\nby BIC and and plot the degree distributions of the estimated common station networks over three\nyears in Figure 2. From the common structure learned by our method, two stations are found to with\nhigher connectivity and identi\ufb01ed as hubs. It turns out that one is close to Union Station (a major\ntransportation hub) and the other is close to Dupont Circle (a popular residential neighborhood).\nTherefore, it is not surprising the two stations play an important role in the dependence graph.\n\nAcknowledgment\n\nThis work is supported in part by grants NSF DMS-1916472 and NSF DMS-1811768.\n\nReferences\n[1] Banerjee, S. and Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate\n\nAnalysis, 136:147\u2013162.\n\n[2] Barab\u00e1si, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439):509\u2013\n\n512.\n\n[3] Cai, T., Liu, W., and Luo, X. (2011). A constrained (cid:96)1 minimization approach to sparse precision matrix\n\nestimation. Journal of the American Statistical Association, 106(494):594\u2013607.\n\n[4] Carvalho, C. M. and Scott, J. G. (2009). Objective Bayesian model selection in Gaussian graphical models.\n\nBiometrika, 96(3):497\u2013512.\n\n[5] Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation\n\nacross multiple classes. Journal of the Royal Statistical Society: Series B, 76(2):373\u2013397.\n\n[6] Dempster, A. P. (1972). Covariance selection. Biometrics, 28(1):157\u2013175.\n\n9\n\n500015000250000.300.320.34Averaged AAFENumber of Nonzero Off\u2212diagonal EntriesOur methodBAGUSPooledGGL05101520250204060DegreeCount050100150024DegreeCount\f[7] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general Gaussian graphi-\ncal models with application to multivariate lattice data. Journal of the American Statistical Association,\n106(496):1418\u20131433.\n\n[8] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.\n\nJournal of the American Statistical Association, 96(456):1348\u20131360.\n\n[9] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical\n\nlasso. Biostatistics, 9(3):432\u2013441.\n\n[10] Gan, L., Narisetty, N. N., and Liang, F. (2019). Bayesian regularization for graphical models with unequal\n\nshrinkage. Journal of the American Statistical Association, 114(527):1218\u20131231.\n\n[11] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American\n\nStatistical Association, 88(423):881\u2013889.\n\n[12] Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2011). Joint estimation of multiple graphical models.\n\nBiometrika, 98(1):1\u201315.\n\n[13] Lee, W. and Liu, Y. (2015). Joint estimation of multiple precision matrices with common structures. The\n\nJournal of Machine Learning Research, 16(1):1035\u20131062.\n\n[14] Li, Z., Mccormick, T., and Clark, S. (2019). Bayesian joint spike-and-slab graphical lasso. In International\n\nConference on Machine Learning, pages 3877\u20133885.\n\n[15] Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and\n\nalgorithmic theory for local optima. Journal of Machine Learning Research, 16:559\u2013616.\n\n[16] Loh, P.-L. and Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex\n\nregularization. The Annals of Statistics, 45(6):2455\u20132482.\n\n[17] Ma, J. and Michailidis, G. (2016). Joint structural estimation of multiple graphical models. The Journal of\n\nMachine Learning Research, 17(1):5777\u20135824.\n\n[18] Mazumder, R. and Hastie, T. (2012). The graphical lasso: New insights and alternatives. Electronic\n\nJournal of Statistics, 6:2125\u20132149.\n\n[19] Mohammadi, A. and Wit, E. C. (2015). Bayesian structure learning in sparse Gaussian graphical models.\n\nBayesian Analysis, 10:109\u2013138.\n\n[20] Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The\n\nAnnals of Statistics, 42(2):789\u2013817.\n\n[21] Peterson, C., Stingo, F. C., and Vannucci, M. (2015). Bayesian inference of multiple Gaussian graphical\n\nmodels. Journal of the American Statistical Association, 110(509):159\u2013174.\n\n[22] Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation\n\nby minimizing (cid:96)1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980.\n\n[23] Ro\u02c7ckov\u00e1, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection. Journal\n\nof the American Statistical Association, 109(506):828\u2013846.\n\n[24] Ro\u02c7ckov\u00e1, V. and George, E. I. (2018). The spike-and-slab lasso. Journal of the American Statistical\n\nAssociation, 113(521):431\u2013444.\n\n[25] Tan, L. S. L., Jasra, A., De Iorio, M., and Ebbels, T. M. D. (2017). Bayesian inference for multiple\nGaussian graphical models with application to metabolic association networks. The Annals of Applied\nStatistics, 11(4):2222\u20132251.\n\n[26] Wang, H. and Li, S. (2012). Ef\ufb01cient Gaussian graphical model determination under G-Wishart prior\n\ndistributions. Electronic Journal of Statistics, 6:168\u2013198.\n\n[27] Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso. Bayesian\n\nAnalysis, 10(4):909\u2013936.\n\n[28] Yang, C., Gan, L., Wang, Z., Shen, Jiaming, X., Jinfeng, and Han, J. (2019). Query-speci\ufb01c knowledge\nsummarization with entity evolutionary networks. In The 28th ACM International Conference on Information\nand Knowledge Management (CIKM).\n\n10\n\n\f[29] Yang, X. and Narisetty, N. N. (in press). Consistent group selection with Bayesian high dimensional\n\nmodeling. Bayesian Analysis.\n\n[30] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika,\n\n94(1):19\u201335.\n\n[31] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of\n\nStatistics, 38(2):894\u2013942.\n\n[32] Zhu, Y. and Barber, R. F. (2015). The log-shift penalty for adaptive estimation of multiple Gaussian\n\ngraphical models. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1153\u20131161.\n\n11\n\n\f", "award": [], "sourceid": 5203, "authors": [{"given_name": "Lingrui", "family_name": "Gan", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Xinming", "family_name": "Yang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Naveen", "family_name": "Narisetty", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Feng", "family_name": "Liang", "institution": "Univ. of Illinois Urbana-Champaign"}]}