{"title": "Efficient Recovery of Jointly Sparse Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 1812, "page_last": 1820, "abstract": "We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model,in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the $(2,1)$-norm minimization, which is an extension of the well-known $1$-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP), which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.", "full_text": "Ef\ufb01cient Recovery of Jointly Sparse Vectors\n\nLiang Sun, Jun Liu, Jianhui Chen, Jieping Ye\n\nSchool of Computing, Informatics, and Decision Systems Engineering\n\n{sun.liang,j.liu,jianhui.chen,jieping.ye}asu.edu\n\nArizona State University\n\nTempe, AZ 85287\n\nAbstract\n\nWe consider the reconstruction of sparse signals in the multiple measurement vec-\ntor (MMV) model, in which the signal, represented as a matrix, consists of a set\nof jointly sparse vectors. MMV is an extension of the single measurement vector\n(SMV) model employed in standard compressive sensing (CS). Recent theoret-\nical studies focus on the convex relaxation of the MMV problem based on the\n(2, 1)-norm minimization, which is an extension of the well-known 1-norm mini-\nmization employed in SMV. However, the resulting convex optimization problem\nin MMV is signi\ufb01cantly much more dif\ufb01cult to solve than the one in SMV. Ex-\nisting algorithms reformulate it as a second-order cone programming (SOCP) or\nsemide\ufb01nite programming (SDP) problem, which is computationally expensive\nto solve for problems of moderate size. In this paper, we propose a new (dual)\nreformulation of the convex optimization problem in MMV and develop an ef\ufb01-\ncient algorithm based on the prox-method. Interestingly, our theoretical analysis\nreveals the close connection between the proposed reformulation and multiple ker-\nnel learning. Our simulation studies demonstrate the scalability of the proposed\nalgorithm.\n\n1 Introduction\n\nCompressive sensing (CS), also known as compressive sampling, has recently received increasing\nattention in many areas of science and engineering [3]. In CS, an unknown sparse signal is recon-\nstructed from a single measurement vector. Recent theoretical studies show that one can recover\ncertain sparse signals from far fewer samples or measurements than traditional methods [4, 8]. In\nthis paper, we consider the problem of reconstructing sparse signals in the multiple measurement\nvector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse\nvectors. MMV is an extension of the single measurement vector (SMV) model employed in standard\ncompressive sensing.\nThe MMV model was motivated by the need to solve the neuromagnetic inverse problem that arises\nin Magnetoencephalography (MEG), which is a modality for imaging the brain [7]. It arises from\na variety of applications, such as DNA microarrays [11], equalization of sparse communication\nchannels [6], echo cancellation [9], magenetoencephalography [12], computing sparse solutions to\nlinear inverse problems [7], and source localization in sensor networks [17]. Unlike SMV, the signal\nin the MMV model is represented as a set of jointly sparse vectors sharing their common nonzeros\noccurring in a set of locations [5, 7]. It has been shown that the additional block-sparse structure can\nlead to improved performance in signal recovery [5, 10, 16, 21].\nSeveral recovery algorithms have been proposed for the MMV model in the past [5, 7, 18, 24, 25].\nSince the sparse representation problem is a combinatorial optimization problem and is in general\nNP-hard [5], the algorithms in [18, 25] employ the greedy strategy to recover the signal using an\niterative scheme. One alternative is to relax it into a convex optimization problem, from which the\n\n1\n\n\fglobal optimal solution can be obtained. The most widely studied approach is the one based on the\n(2, 1)-norm minimization [5, 7, 10]. A similar relaxation technique (via the 1-norm minimization)\nis employed in the SMV model. Recent studies have shown that most of theoretical results on the\nconvex relaxation of the SMV model can be extended to the MMV model [5], although further the-\noretical investigation is needed [26]. Unlike the SMV model where the 1-norm minimization can\nbe solved ef\ufb01ciently, the resulting convex optimization problem in MMV is much more dif\ufb01cult to\nsolve. Existing algorithms formulate it as a second-order cone programming (SOCP) or semde\ufb01nite\nprogramming (SDP) [16] problem, which can be solved using standard software packages such as\nSeDuMi [23]. However, for problems of moderate size, solving either SOCP or SDP is computa-\ntionally expensive, which limits their use in practice.\nIn this paper, we derive a dual reformulation of the (2, 1)-norm minimization problem in MMV.\nMore especially, we show that the (2, 1)-norm minimization problem can be reformulated as a\nmin-max problem, which can be solved ef\ufb01ciently via the prox-method with a nearly dimension-\nindependent convergence rate [19]. Compared with existing algorithms, our algorithm can scale to\nlarger problems while achieving high accuracy. Interestingly, our theoretical analysis reveals the\nclose relationship between the resulting min-max problem and multiple kernel learning [14]. We\nhave performed simulation studies and our results demonstrate the scalability of the proposed algo-\nrithm in comparison with existing algorithms.\nNotations: All matrices are boldface uppercase. Vectors are boldface lowercase. Sets and spaces\nare denoted with calligraphic letters. The p-norm of the vector v = (v1,\u00b7\u00b7\u00b7 , vd)T \u2208 IRd is de\ufb01ned\nas (cid:107)v(cid:107)p :=\np . The inner product on IRm\u00d7d is de\ufb01ned as (cid:104)X, Y(cid:105) = tr(XT Y). For\nmatrix A \u2208 IRm\u00d7d, we denote by ai and ai the ith row and the ith column of A, respectively. The\n(r, p)-norm of A is de\ufb01ned as:\n\n(cid:179)(cid:80)d\n\ni=1 |vi|p\n\n(cid:180) 1\n\n(cid:195)\nm(cid:88)\n\n(cid:33) 1\n\np\n\n(cid:107)A(cid:107)r,p :=\n\n(cid:107)ai(cid:107)p\n\nr\n\n.\n\n(1)\n\n2 The Multiple Measurement Vector Model\n\ni=1\n\nIn the SMV model, one aims to recover the sparse signal w from a measurement vector b = Aw\nfor a given matrix A [3]. The SMV model can be extended to the multiple measurement vector\n(MMV) model, in which the signal is represented as a set of jointly sparse vectors sharing a common\nset of nonzeros [5, 7]. The MMV model aims to recover the sparse representations for SMVs\nsimultaneously. It has been shown that the MMV model provably improves the standard CS recovery\nby exploiting the block-sparse structure [10, 21].\nSpeci\ufb01cally, in the MMV model we consider the reconstruction of the signal represented by a matrix\nW \u2208 IRd\u00d7n, which is given by a dictionary (or measurement matrix) A \u2208 IRm\u00d7d and multiple\nmeasurement vector B \u2208 IRm\u00d7n such that\n\nB = AW.\n\n(2)\nEach column of A is associated with an atom, and a set of atom is called a dictionary. A sparse\nrepresentation means that the matrix W has a small number of rows containing nonzero entries.\nUsually, we have m (cid:191) d and d > n.\nSimilar to SMV, we can use (cid:107)W(cid:107)p,0 to measure the number of rows in W that contain nonzero en-\ntries. Thus, the problem of \ufb01nding the sparsest representation of the signal W in MMV is equivalent\nto solving the following problem, a.k.a. the sparse representation problem:\n\n(cid:107)W(cid:107)p,0,\n\ns.t. AW = B.\n\n(P0) : min\nW\n\n(3)\nSome typical choices of p include p = \u221e and p = 2 [25]. However, solving (P0) requires enumer-\nating all subsets of the set {1, 2,\u00b7\u00b7\u00b7 , d}, which is essentially a combinatorial optimization problem\nand is in general NP-hard [5]. Similar to the use of the 1-norm minimization in the SMV model, one\nnatural alternative is to use (cid:107)W(cid:107)p,1 instead of (cid:107)W(cid:107)p,0, resulting in the following convex optimiza-\ntion problem (P1):\n(4)\n\ns.t. AW = B.\n\n(cid:107)W(cid:107)p,1,\n\n(P1) : min\nW\n\n2\n\n\fThe relationship between (P0) and (P1) for the MMV model has been studied in [5].\nFor p = 2, the optimal W is given by solving the following convex optimization problem:\n\nmin\nW\n\n1\n2\n\n(cid:107)W(cid:107)2\n\n2,1\n\ns.t. AW = B.\n\n(5)\n\nExisting algorithms formulate Eq. (5) as a second-order cone programming (SOCP) problem or a\nsemide\ufb01nite programming (SDP) problem [16]. Recall that the optimizaiton problem in Eq. (5) is\nequivalent to the following problem by removing the square in the objective:\n\nmin\nW\n\n1\n2\n\n(cid:107)W(cid:107)2,1\n\ns.t. AW = B.\n\nd(cid:88)\n\nBy introducing auxiliary variable ti(i = 1,\u00b7\u00b7\u00b7 , d), this problem can be reformulated in the standard\nsecond-order cone programming (SOCP) formulation:\n\nmin\n\nW,t1,\u00b7\u00b7\u00b7 ,td\n\ns.t.\n\nti\n\n1\n2\n(cid:107)Wi(cid:107)2 \u2264 ti, ti \u2265 0, i = 1,\u00b7\u00b7\u00b7 , d, AW = B.\n\ni=1\n\n(6)\n\nBased on this SOCP formulation, it can also be transformed into the standard semide\ufb01nite program-\nming (SDP) formulation:\n\nd(cid:88)\n\ni=1\n\nti\n\n1\n2\n\n(cid:183)\n\ntiI WiT\nWi\nti\n\nmin\n\nW,t1,\u00b7\u00b7\u00b7 ,td\n\ns.t.\n\n(cid:184)\n\n\u2265 0, ti \u2265 0, i = 1,\u00b7\u00b7\u00b7 , d, AW = B.\n\n(7)\n\nThe interior point method [20] and the bundle method [13] can be applied to solve SOCP and SDP.\nHowever, they do not scale to problems of moderate size, which limits their use in practice.\n\n3 The Proposed Dual Formulation\n\nIn this section we present a dual reformulation of the optimization problem in Eq. (5). First, some\npreliminary results are summarized in Lemmas 1 and 2:\nLemma 1. Let A and X be m-by-d matrices. Then the following holds:\n\n(cid:104)A, X(cid:105) \u2264 1\n2\n\n2,1 + (cid:107)A(cid:107)2\n\n2,\u221e\n\n.\n\n(8)\n\n(cid:80)m\nWhen the equality holds, we have (cid:107)X(cid:107)2,1 = (cid:107)A(cid:107)2,\u221e.\nProof. It follows from the de\ufb01nition of the (r, p)-norm in Eq. (1) that (cid:107)X(cid:107)2,1 =\ni=1 (cid:107)xi(cid:107)2,\nand (cid:107)A(cid:107)2,\u221e = max1\u2264i\u2264m (cid:107)ai(cid:107)2. Without loss of generality, we assume that (cid:107)ak(cid:107)2 =\nmax1\u2264i\u2264m (cid:107)ai(cid:107)2 for 1 \u2264 k \u2264 m . Thus, (cid:107)A(cid:107)2,\u221e = (cid:107)ak(cid:107)2, and we have\n\n(cid:161)(cid:107)X(cid:107)2\n\n(cid:162)\n\n(cid:104)A, X(cid:105) =\n\n\u2264 1\n2\n\ni=1\n\ni=1\n\ni=1\n\n(cid:33)2\n\naixiT \u2264 m(cid:88)\nm(cid:88)\n(cid:107)ai(cid:107)2(cid:107)xi(cid:107)2 \u2264 m(cid:88)\n\uf8eb\uf8ed(cid:107)ak(cid:107)2\n\uf8f6\uf8f8 =\n(cid:195)\nm(cid:88)\n(cid:190)\n\n(cid:107)xi(cid:107)2\n\n2 +\n\n(cid:189)\n(cid:104)A, X(cid:105) \u2212 1\n2\n\n(cid:107)X(cid:107)2\n\nmax\nX\n\n1\n2\n\ni=1\n\n2,1\n\nClearly, the last inequality becomes equality when (cid:107)X(cid:107)2,1 = (cid:107)A(cid:107)2,\u221e.\nLemma 2. Let A and X be de\ufb01ned as in Lemma 1. Then the following holds:\n\nm(cid:88)\n\n(cid:162)\n\ni=1\n\n.\n\n(cid:107)ak(cid:107)2(cid:107)xi(cid:107)2 = (cid:107)ak(cid:107)2\n\n(cid:107)xi(cid:107)2\n\n(cid:161)(cid:107)A(cid:107)2\n\n2,\u221e + (cid:107)X(cid:107)2\n\n2,1\n\n=\n\n(cid:107)A(cid:107)2\n\n2,\u221e.\n\n1\n2\n\n3\n\n\fProof. Denote the set Q = {k : 1 \u2264 k \u2264 m,(cid:107)ak(cid:107)2 = max1\u2264i\u2264m (cid:107)ai(cid:107)2}. Let {\u03b1k}m\nthat \u03b1k = 0 for k /\u2208 Q, \u03b1k \u2265 0 for k \u2208 Q, and\nof Lemma 1 become equalities if and only if we construct the matrix X as follows:\n\nk=1 be such\nk=1 \u03b1k = 1. Clearly, all inequalities in the proof\n\nThus, the maximum of (cid:104)A, X(cid:105) \u2212 1\nas in Eq. (9).\n\n2,\u221e, which is achieved when X is constructed\n\n(9)\n\nBased on the results established in Lemmas 1 and 2, we can derive the dual formulation of the\noptimization problem in Eq. (5) as follows. First we construct the Lagrangian L:\n\nL(W, U) =\n\n1\n2\n\n(cid:107)W(cid:107)2\n\n2,1 \u2212 (cid:104)U, AW \u2212 B(cid:105) =\n\n1\n2\n\n(cid:107)W(cid:107)2\n\n2,1 \u2212 (cid:104)U, AW(cid:105) + (cid:104)U, B(cid:105).\n\nThe dual problem can be formulated as follows:\n\n(cid:189)\n\nmax\nU\nIt follows from Lemma 2 that\n\nmin\nW\n\n(cid:190)\n2,1 \u2212 (cid:104)U, AW(cid:105)\n\nmin\nW\n\n(cid:107)W(cid:107)2\n\n1\n2\n\n(cid:107)W(cid:107)2\n\n1\n2\n\n2,1 \u2212 (cid:104)U, AW(cid:105) + (cid:104)U, B(cid:105).\n(cid:189)\n\n(cid:190)\n2,1 \u2212 (cid:104)AT U, W(cid:105)\n\n(cid:107)W(cid:107)2\n\n= min\nW\n\n1\n2\n\n(10)\n\n= \u22121\n2\n\n(cid:107)AT U(cid:107)2\n\n2,\u221e.\n\nNote that from Lemma 2, the equality holds if and only if the optimal W\u2217 can be represented as\n\n(11)\nwhere \u03b1 = [\u03b11,\u00b7\u00b7\u00b7 , \u03b1d]T \u2208 IRd, \u03b1i \u2265 0 if (cid:107)(AT U)i(cid:107)2 = (cid:107)AT U(cid:107)2,\u221e, \u03b1i = 0 if (cid:107)(AT U)i(cid:107)2 <\n(cid:107)AT U(cid:107)2,\u221e, and\ni=1 \u03b1i = 1. Thus, the dual problem can be simpli\ufb01ed into the following form:\n\n(cid:80)d\n\nW\u2217 = diag(\u03b1)AT U,\n\nmax\nU\n\n(12)\nFollowing the de\ufb01nition of the (2,\u221e)-norm, we can reformulate the dual problem in Eq. (12) as a\nmin-max problem, as summarized in the following theorem:\nTheorem 1. The optimization problem in Eq. (5) can be formulated equivalently as:\n\n(cid:107)AT U(cid:107)2\n\n2,\u221e + (cid:104)U, B(cid:105).\n\n\u22121\n2\n\n(cid:80)d\n\nmin\n\ni=1 \u03b8i=1,\u03b8i\u22650\n\nmax\nu1,\u00b7\u00b7\u00b7 ,un\n\nj bj \u2212 1\nuT\n2\n\n\u03b8iuT\n\nj Giuj\n\n,\n\n(13)\n\n(cid:40)\n\nn(cid:88)\n\nj=1\n\nd(cid:88)\n\ni=1\n\n(cid:41)\n\nwhere the matrix Gi is de\ufb01ned as Gi = aiaT\nProof. Note that (cid:107)AT U(cid:107)2\n\n2,\u221e can be reformulated as follows:\n\ni (1 \u2264 i \u2264 d), and ai is the ith column of A.\n\n(cid:80)m\n\n(cid:189)\n\nif k \u2208 Q\notherwise.\n\nxk =\n2(cid:107)X(cid:107)2\n\n\u03b1kak,\n0,\n2(cid:107)A(cid:107)2\n\n2,1 is 1\n\n(cid:107)AT U(cid:107)2\n\n2,\u221e = max\n1\u2264i\u2264d\n\n(cid:170)\ni U(cid:107)2\nd(cid:88)\n\n2\n\n(cid:169)(cid:107)aT\n(cid:80)d\n\nmax\n\ni=1 \u03b8i=1\n\ni=1\n\n=\n\n\u03b8i\u22650,\n\n= max\n1\u2264i\u2264d\n\n{tr(UT aiaT\n\ni U)} = max\n1\u2264i\u2264d\n\n{tr(UT GiU)}\n\n\u03b8itr(UT GiU).\n\n(14)\n\nSubstituting Eq. (14) into Eq. (12), we obtain the following problem:\n(cid:104)U, B(cid:105) \u2212 1\n2\n\n2,\u221e + (cid:104)U, B(cid:105) \u21d4 max\n\n(cid:107)AT U(cid:107)2\n\n\u22121\n2\n\n(cid:80)d\n\nmax\nU\n\nmin\n\nU\n\ni=1 \u03b8i=1,\u03b8i\u22650\n\nd(cid:88)\n\ni=1\n\n\u03b8itr(UT GiU).\n\n(15)\n\nSince the Slater\u2019s condition [2] is satis\ufb01ed, the minimization and maximization in Eq. (15) can be\nexchanged, resulting in the min-max problem in Eq. (13).\nCorollary 1. Let (\u03b8\u2217, U\u2217) be the optimal solution to Eq. (13) where \u03b8\u2217 = (\u03b8\u2217\nthen (cid:107)(AT U\u2217)i(cid:107)2 = (cid:107)AT U\u2217(cid:107)2,\u221e.\n\n1,\u00b7\u00b7\u00b7 , \u03b8\u2217\n\nd)T . If \u03b8\u2217\n\ni > 0,\n\n4\n\n\f(cid:107)(cid:161)\n\nBased on the solution to the dual problem in Eq. (13), we can construct the optimal solution to the\nprimal problem in Eq. (5) as follows. Let W\u2217 be the optimal solution of Eq. (5). It follows from\nLemma 2 that we can construct W\u2217 based on AT U\u2217 as in Eq. (11). Recall that W\u2217 must satisfy\nthe equality constraint AW\u2217 = B. The main result is summarized in the following theorem:\nTheorem 2. Given W\u2217 = diag(\u03b1)AT U\u2217, where \u03b1 = [\u03b11,\u00b7\u00b7\u00b7 , \u03b1d] \u2208 IRd, \u03b1i \u2265 0, \u03b1i > 0 only if\ni=1 \u03b1i = 1. Then, AW\u2217 = B if and only if (\u03b1, U\u2217) is the\n\nAT U\u2217(cid:162)i (cid:107)2 = (cid:107)AT U\u2217(cid:107)2,\u221e, and\n\noptimal solution to the problem in Eq. (13).\nProof. First we assume that (\u03b1, U\u2217) is the optimal solution to the problem in Eq. (13). It follows\nthat the partial derivative of the objective function with respect to U\u2217 in Eq. (13) is 0, that is,\n\n(cid:80)d\n\nB \u2212 Adiag(\u03b1)AT U\u2217 = 0 \u21d4 AW\u2217 = B.\n\nNext we prove the reverse direction by assuming AW\u2217 = B. Since W\u2217 = diag(\u03b1)AT U\u2217, we have\n(16)\n\n0 = B \u2212 AW\u2217 = B \u2212 Adiag(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d)AT U\u2217.\n\nDe\ufb01ne the function \u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U) as\n\n\u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U) = (cid:104)U, B(cid:105) \u2212 1\n2\n\n\u03b8itr(UT GiU) =\n\nj bj \u2212 1\nuT\n2\n\n\u03b8iuT\n\nj Giuj\n\n.\n\nWe consider the function \u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U) with \ufb01xed \u03b8i = \u03b1i(1 \u2264 i \u2264 d). Note that this function\nis concave with respect to U, thus its maximum is achieved when its partial derivative with respect\n\u2202U is zero when U = U\u2217. Thus, we have\nto U is zero. It follows from Eq. (16) that \u2202\u03c6\nWith a \ufb01xed U = U\u2217, \u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U\u2217) is a linear combination of \u03b8i(1 \u2264 i \u2264 d) as:\n\n\u2200U, \u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U) \u2264 \u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U\u2217).\n\n(cid:40)\n\nn(cid:88)\n\nj=1\n\nd(cid:88)\n\ni=1\n\nd(cid:88)\n\ni=1\n\n(cid:41)\n\nd(cid:88)\n\n\u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U\u2217) = (cid:104)U\u2217, B(cid:105) \u2212 1\n2\n\n\u03b8i(cid:107)(AT U\u2217)i(cid:107)2\n2.\n\ni=1\n\nBy the assumption, we have (cid:107)(AT U\u2217)i(cid:107) = (cid:107)AT U\u2217(cid:107)2,\u221e, if \u03b1i > 0. Thus, we have\n\nd(cid:88)\n(cid:80)d\n\u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U\u2217) \u2264 \u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U\u2217),\u2200\u03b81,\u00b7\u00b7\u00b7 , \u03b8d satisfying\ni=1 \u03b8i = 1, \u03b8i \u2265 0, we have\n\nTherefore, for any U, \u03b81,\u00b7\u00b7\u00b7 , \u03b8d such that\n(17)\nwhich implies that (\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U\u2217) is a saddle point of the min-max problem in Eq. (13). Thus,\n(\u03b1, U\u2217) is the optimal solution to the problem in Eq. (13).\n\n\u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U) \u2264 \u03c6(\u03b11,\u00b7\u00b7\u00b7 , \u03b1d, U\u2217) \u2264 \u03c6(\u03b81,\u00b7\u00b7\u00b7 , \u03b8d, U\u2217),\n\n\u03b8i = 1, \u03b8i \u2265 0.\n\ni=1\n\nTheorem 2 shows that we can reconstruct the solution to the primal problem based on the solution to\nthe dual problem in Eq. (13). It paves the way for the ef\ufb01cient implementation based on the min-max\nformulation in Eq.(13). In this paper, the prox-method [19], which is discussed in detail in the next\nsection, is employed to solve the dual problem in Eq. (13).\nAn interesting observation is that the resulting min-max problem in Eq. (13) is closely related to the\noptimization problem in multiple kernel learning (MKL) [14]. The min-max problem in Eq. (13)\ncan be reformulated as\n\n(cid:190)\n\nmin\n\ni=1 \u03b8i=1,\u03b8i\u22650\n\nmax\nu1,\u00b7\u00b7\u00b7 ,un\n\nj bj \u2212 1\nuT\n2\n\nuT\nj Guj\n\n,\n\n(18)\n\nwhere the positive semide\ufb01nite (kernel) matrix G is constrained as a linear combination of a set of\nbase kernels\n\nas G =\n\nGi = aiaiT\n\ni=1 \u03b8iGi.\n\n(cid:189)\n\nn(cid:88)\n\nj=1\n\n(cid:80)d\n\n(cid:80)d\n(cid:111)d\n\ni=1\n\n(cid:110)\n\nThe formulation in Eq. (18) connects the MMV problem to MKL. Many ef\ufb01cient algorithms [14,\n22, 27] have been developed in the past for MKL, which can be applied to solve (13). In [27],\nan extended level set method was proposed to solve MKL, which was shown to outperform the\none based on the semi-in\ufb01nite linear programming formulation [22]. However, the extended level\n\u221a\nset method involves a linear programming in each iteration and its theoretical convergence rate of\nO(1/\nN) (N denotes the number of iterations) is slower than the proposed algorithm presented in\nthe next section.\n\n5\n\n\f4 The Main Algorithm\n\nWe propose to employ the prox-method [19] to solve the min-max formulation in Eq. (13), which has\na differentiable and convex-concave objective function. The algorithm is called \u201cMMVprox\u201d. The\nprox-method is a \ufb01rst-order method [1, 19] which is specialized for solving the saddle point problem\nand has a nearly dimension-independent convergence rate of O(1/N) (N denotes the number of\niterations). We show that each iteration of MMVprox has a low computational cost, thus it scales to\nlarge-size problems.\nThe key idea is to convert the min-max problem to the associated variational inequality (v.i.) prob-\nlem, which is then iteratively solved by a series of v.i. problems. Let z = (\u03b8, U). The problem in\nEq. (13) is equivalent to the following associated v.i. problem [19]:\n\nFind z\u2217 = (\u03b8\u2217, U\u2217) \u2208 S : (cid:104)F (z\u2217), z \u2212 z\u2217(cid:105) \u2265 0,\u2200z \u2208 S,S = X \u00d7 Y,\n\n(19)\n\nwhere\n\n(cid:181)\n\n(cid:182)\n\u2202U \u03c6(\u03b8, U)\n\nF (z) =\n\n(20)\nis an operator constituted by the gradient of \u03c6(\u00b7,\u00b7), X = {\u03b8 \u2208 IRd : (cid:107)\u03b8(cid:107)1 = 1, \u03b8i \u2265 0}, and\nY = IRm\u00d7n.\nIn solving the v.i. problem in Eq. (19), one key building block is the following projection problem:\n\n\u03c6(\u03b8, U),\u2212 \u2202\n\n\u2202\n\u2202\u03b8\n\n(cid:183)\n\n(cid:184)\n2 + (cid:104)\u02dcz, \u00afz \u2212 z(cid:105)\n\n,\n\nPz(\u00afz) = arg min\n\u02dcz\u2208S\n\n(cid:107)\u02dcz(cid:107)2\n\n1\n2\n\nwhere \u00afz = ( \u00af\u03b8, \u00afU) and \u02dcz = ( \u02dc\u03b8, \u02dcU). Denote (\u03b8\u2217, U\u2217) = Pz(\u00afz). It is easy to verify that\n\nand\n\n(cid:107) \u02dc\u03b8 \u2212 (\u03b8 \u2212 \u00af\u03b8)(cid:107)2\n2,\n\n1\n2\n\n\u03b8\u2217 = arg min\n\u02dc\u03b8\u2208X\nU\u2217 = U \u2212 \u00afU.\n\n(21)\n\n(22)\n\n(23)\n\nFollowing [19], we present the pseudocode of the proposed MMVprox algorithm in Algorithm 1. In\neach iteration, we compute the projection (21) so that wt,s is suf\ufb01ciently close to wt,s\u22121 (controlled\nby the parameter \u03b4).\n[L denotes the Lipschitz\ncontinuous constant of the operator F (\u00b7)], the inner iteration converges within two iterations, i.e.,\nwt,2 = wt,1 always holds. Moreover, Algorithm 1 has a global dimension-independent convergence\nrate of O(1/N).\n\nIt has been shown in [19] that, when \u03b3 \u2264 1\u221a\n\n2L\n\nAlgorithm 1 The MMVprox Algorithm\nInput: A, B, \u03b3, z0 = (\u03b80, U0), and \u03b4\nOutput: \u03b8, U and W.\nStep t (t \u2265 1): Set wt,0 = zt\u22121 and \ufb01nd the smallest s = 1, 2, . . . such that\nwt,s = Pzt\u22121(\u03b3F (wt,s\u22121)),(cid:107)wt,s \u2212 wt,s\u22121(cid:107)2 \u2264 \u03b4.\n(cid:80)t\n\n(cid:80)t\n\nSet zt = wt,s\nFinal Step: Set \u03b8 =\n\ni=1 \u03b8i\n\nt\n\n, U =\n\ni=1 Ui\n\nt\n\n, W = diag(\u03b8)AT U.\n\nTime Complexity It costs O(dmn) to evaluate the operator F (\u00b7) at a given point. \u03b8\u2217 in Eq. (22)\ninvolves the Euclidean projection onto the simplex [1], which can be solved in linear time, i.e., in\nO(d); and U\u2217 in Eq. (23) can be analytically computed in O(mn) time. Recall that at each iteration\nt, the inner iteration is at most 2. Thus, the time complexity for any given outer iteration is O(dmn).\nOur analysis shows that MMVprox scales to large-size problems.\nIn comparison, the second-order methods such as SOCP have a much higher complexity per iter-\nation. According to [15], the SOCP in Eq. (6) costs O(d3(n + 1)3) per iteration. In MMV, d is\ntypically larger than m. In this case, the proposed MMVprox algorithm has a much smaller cost\nper iteration than SOCP. This explains why MMVprox scales better than SOCP, as shown in our\nexperiments in the next section.\n\n6\n\n\fTable 1: The averaged recovery results over 10 experiments (d = 100, m = 50, and n = 80).\n\n(cid:112)(cid:107)W \u2212 Wp(cid:107)2\n\n(cid:112)(cid:107)AWp \u2212 B(cid:107)2\n\nF /(dn)\n\nF /(mn)\n\nData set\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nMean\nStd\n\n5 Experiments\n\n3.2723e-6\n3.4576e-6\n2.6971e-6\n2.4099e-6\n2.9611e-6\n2.5701e-6\n2.0884e-6\n2.3454e-6\n2.6807e-6\n2.7172e-6\n2.7200e-6\n4.1728e-7\n\n1.4467e-5\n1.8234e-5\n1.4464e-5\n1.4460e-5\n1.4463e-5\n1.4459e-5\n1.4469e-5\n1.4475e-5\n1.4461e-5\n1.4481e-5\n1.4843e-5\n1.1914e-6\n\nIn this section, we conduct simulations to evaluate the proposed MMVprox algorithm in terms of the\nrecovery quality and scalability.\nExperiment Setup We generated a set of synthetic data sets (by varying the values of m, n, and\nd) for our experiments: the entries in A \u2208 IRm\u00d7d were independently generated from the standard\nnormal distribution N (0, 1); W \u2208 IRd\u00d7n (the ground truth of the recovery problems) was generated\nin two steps: (1) randomly select k rows with nonzero entries; (2) randomly generate the entries of\nthose k rows from N (0, 1). We denote by Wp the solution obtained from the proposed MMVprox\nalgorithm. Ideally, Wp should be close to W. Our experiments were performed on a PC with Intel\nCore 2 Duo T9500 2.6G CPU and 4G RAM. We employed the optimization package SeDuMi [23]\nfor solving the SOCP formulation. All codes were implemented in Matlab. In all experiments, we\nterminate MMVprox when the change of the consecutive approximate solutions is less than 1e-6.\nRecovery Quality In this experiment, we evaluate the recovery quality of the proposed MMVprox\nalgorithm. We applied MMVprox on the data sets of size d = 100, m = 50, n = 80, and reported\nthe averaged experimental results over 10 random repetitions. We measured the recovery quality in\nF /(mn),\nterms of the mean squared error:\nwhich measures the violation of the constraint in Eq. (5). The experimental results are presented in\nTable 1. We can observe from the table that MMVprox recovers the sparse signal successfully in all\ncases.\nNext, we study how the recovery error changes as the sparsity of W varies. Speci\ufb01cally, we applied\nMMVprox on the data sets of size d = 100, m = 400, and n = 10 with k (the number of nonzero\nrows of W) varying from 0.05d to 0.7d, and used\nF /(dn) as the recovery quality\nmeasure. The averaged experimental results over 20 random repetitions are presented in Figure 1.\nWe can observe from the \ufb01gure that MMVprox works well in all cases, and a larger k (less sparse\nW) tends to result in a larger recovery error.\n\n(cid:112)(cid:107)AWp \u2212 B(cid:107)2\n\n(cid:112)(cid:107)W \u2212 Wp(cid:107)2\n\n(cid:112)(cid:107)W \u2212 Wp(cid:107)2\n\nF /(dn). We also reported\n\nFigure 1: The increase of the recovery error as the sparsity level decreases\n\n7\n\n0.050.20.350.50.700.511.52x 10\u22126k/dpkW\u2212Wpk2F/(dn)\fScalability In this experiment, we study the scalability of the proposed MMVprox algorithm. We\ngenerated a collection of data sets by varying m from 10 to 200 with a step size of 10, and setting\nn = 2m and d = 4m accordingly. We applied SOCP and MMVprox on the data sets and recorded\ntheir computation time. The experimental results are presented in Figure 2 (a), where the x-axis\ncorresponds to the value of m, and the y-axis corresponds to log(t), where t denotes the computa-\ntion time (in seconds). We can observe from the \ufb01gure that the computation time of both algorithms\nincreases as m increases and SOCP is faster than MMVprox on small problems (m \u2264 40); when\nm > 40, MMVprox outperforms SOCP; when the value of m is large (m > 80), the SOCP for-\nmulation cannot be solved by SeDuMi, while MMVprox can still be applied. This experimental\nresult demonstrates the good scalability of the proposed MMVprox algorithm in comparison with\nthe SOCP formulation.\n\n(a)\n\n(b)\n\nFigure 2: Scalability comparison of MMVprox and SOCP: (a) the computation time for both algorithms as the\nproblem size varies; and (b) the average computation time of each iteration for both algorithms as the problem\nsize varies. The x-axis denotes the value of m, and the y-axis denotes the computation time in seconds (in log\nscale).\n\nTo further examine the scalability of both algorithms, we compare the execution time of each itera-\ntion for both SOCP and the proposed algorithm. We use the same setting as in the last experiment,\ni.e., n = 2m, d = 4m, and m ranges from 10 to 200 with a step size of 10. The time comparison\nof SOCP and MMVprox is presented in Figure 2 (b). We observe that MMVprox has a signi\ufb01cantly\nlower cost than SOCP in each iteration (note that SOCP is not applicable for m > 80). This is\nconsistent with our complexity analysis in Section 4.\nWe can observe from Figure 2 that when m is small, the computation time of SOCP and MMVprox\nis comparable, although MMVprox is much faster in each iteration. This is because MMVprox is a\n\ufb01rst-order method, which has a slower convergence rate than the second-order method SOCP. Thus,\nthere is a tradeoff between scalability and convergence rate. Our experiments show the advantage of\nMMVprox for large-size problems.\n\n6 Conclusions\nIn this paper, we consider the (2, 1)-norm minimization for the reconstruction of sparse signals in\nthe multiple measurement vector (MMV) model, in which the signal consists of a set of jointly\nsparse vectors. Existing algorithms formulate it as second-order cone programming or semde\ufb01nite\nprogramming, which is computationally expensive to solve for problems of moderate size. In this\npaper, we propose an equivalent dual formulation for the (2, 1)-norm minimization in the MMV\nmodel, and develop the MMVprox algorithm for solving the dual formulation based on the prox-\nmethod. In addition, our theoretical analysis reveals the close connection between the proposed\ndual formulation and multiple kernel learning. Our simulation studies demonstrate the effectiveness\nof the proposed algorithm in terms of recovery quality and scalability. In the future, we plan to\ncompare existing solvers for multiple kernel learning [14, 22, 27] with the proposed MMVprox\nalgorithm. In addition, we plan to examine the ef\ufb01ciency of the prox-method for solving various\nMKL formulations.\n\nAcknowledgements\nThis work was supported by NSF IIS-0612069, IIS-0812551, CCF-0811790, NIH R01-HG002516,\nNGA HM1582-08-1-0016, and NSFC 60905035.\n\n8\n\n50100150200\u221220246810mlog(t) SOCPMMVprox50100150200\u221210\u22128\u22126\u22124\u22122024mlog(t) SOCPMMVprox\fReferences\n[1] A. Ben-Tal and A. Nemirovski. Non-Euclidean restricted memory level method for large-scale convex\n\noptimization. Mathematical Programming, 102(3):407\u201356, 2005.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.\n[3] E. Cand`es. Compressive sampling. In International Congress of Mathematics, number 3, pages 1433\u2013\n\n1452, Madrid, Spain, 2006.\n\n[4] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly\n\nincomplete frequency information. IEEE Transactions on Information Theory, 52(2):489\u2013509, 2006.\n\n[5] J. Chen and X. Huo. Theoretical results on sparse representations of multiple-measurement vectors. IEEE\n\nTransactions on Signal Processing, 54(12):4634\u20134643, 2006.\n\n[6] S.F. Cotter and B.D. Rao. Sparse channel estimation via matching pursuit with application to equalization.\n\nIEEE Transactions on Communications, 50(3):374\u2013377, 2002.\n\n[7] S.F. Cotter, B.D. Rao, Kjersti Engan, and K. Kreutz-Delgado. Sparse solutions to linear inverse problems\n\nwith multiple measurement vectors. IEEE Transactions on Signal Processing, 53(7):2477\u20132488, 2005.\n\n[8] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n[9] D.L. Duttweiler. Proportionate normalized least-mean-squares adaptation in echo cancelers. IEEE Trans-\n\nactions on Speech and Audio Processing, 8(5):508\u2013518, 2000.\n\n[10] Y.C. Eldar and M. Mishali. Robust recovery of signals from a structured union of subspaces. To Appear\n\nin IEEE Transactions on Information Theory, 2009.\n\n[11] S. Erickson and C. Sabatti. Empirical bayes estimation of a sparse vector of gene expression changes.\n\nStatistical Applications in Genetics and Molecular Biology, 4(1):22, 2008.\n\n[12] I.F. Gorodnitsky, J.S. George, and B.D. Rao. Neuromagnetic source imaging with focuss: a recursive\nweighted minimum norm algorithm. Electroencephalography and Clinical Neurophysiology, 95(4):231\u2013\n251, 1995.\n\n[13] H. Jean-Baptiste and C. Lemarechal. Convex Analysis and Minimization Algorithms I: Fundamentals\n\n(Grundlehren Der Mathematischen Wissenschaften). Springer, Berlin, 1993.\n\n[14] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Jouranl of Machine Learning Research, 5:27\u201372, 2004.\n\n[15] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming.\n\nLinear Algebra and its Applications, 284(1-3):193\u2013228, 1998.\n\n[16] F. Parvaresh M. Stojnic and B. Hassibi. On the reconstruction of block-sparse signals with an optimal\n\nnumber of measurements. CoRR, 2008.\n\n[17] D. Malioutov, M. Cetin, and A. Willsky. Source localization by enforcing sparsity through a laplacian. In\n\nIEEE Workshop on Statistical Signal Processing, pages 553\u2013556, 2003.\n\n[18] M. Mishali and Y.C. Eldar. Reduce and boost: Recovering arbitrary sets of jointly sparse vectors. IEEE\n\nTransactions on Signal Processing, 56(10):4692\u20134702, 2008.\n\n[19] A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with Lipschitz\ncontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on\nOptimization, 15(1):229\u2013251, 2005.\n\n[20] Y.E. Nesterov and A.S. Nemirovskii.\n\nInterior-point Polynomial Algorithms in Convex Programming.\n\nSIAM Publications, Philadelphia, PA, 1994.\n\n[21] M. Duarte R.G. Baraniuk, V. Cevher and C. Hegde. Model-based compressive sensing. Submitted to\n\nIEEE Transactions on Information Theory, 2008.\n\n[22] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. Journal of\n\nMachine Learning Research, 7:1531\u20131565, 2006.\n\n[23] J.F. Sturm. Using sedumi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization\n\nMethods and Software, 11(12):625\u2013653, 1999.\n\n[24] J.A. Tropp. Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Pro-\n\ncessing, 86(3):589\u2013602, 2006.\n\n[25] J.A. Tropp, A.C. Gilbert, and M.J. Strauss. Algorithms for simultaneous sparse approximation. Part I:\n\nGreedy pursuit. Signal Processing, 86(3):572\u2013588, 2006.\n\n[26] E. van den Berg and M. P. Friedlander. Joint-sparse recovery from multiple measurements. Technical\n\nReport, Department of Computer Science, University of British Columbia, 2009.\n\n[27] Z. Xu, R. Jin, I. King, and M.R. Lyu. An extended level method for ef\ufb01cient multiple kernel learning. In\n\nAdvances in Neural Information Processing Systems, pages 1825\u20131832, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1056, "authors": [{"given_name": "Liang", "family_name": "Sun", "institution": null}, {"given_name": "Jun", "family_name": "Liu", "institution": null}, {"given_name": "Jianhui", "family_name": "Chen", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}