{"title": "Sparse Estimation with Structured Dictionaries", "book": "Advances in Neural Information Processing Systems", "page_first": 2016, "page_last": 2024, "abstract": "In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit $\\ell_2$ norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionary-dependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the $\\ell_1$ norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions.", "full_text": "Sparse Estimation with Structured Dictionaries\n\nDavid P. Wipf \u2217\n\nVisual Computing Group\nMicrosoft Research Asia\n\ndavidwipf@gmail.com\n\nAbstract\n\nIn the vast majority of recent work on sparse estimation algorithms, performance\nhas been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian\nor Fourier) characterized by unit \u21132 norm, incoherent columns or features. But in\nreality, these types of dictionaries represent only a subset of the dictionaries that\nare actually used in practice (largely restricted to idealized compressive sensing\napplications). In contrast, herein sparse estimation is considered in the context\nof structured dictionaries possibly exhibiting high coherence between arbitrary\ngroups of columns and/or rows. Sparse penalized regression models are analyzed\nwith the purpose of \ufb01nding, to the extent possible, regimes of dictionary invari-\nant performance. In particular, a Type II Bayesian estimator with a dictionary-\ndependent sparsity penalty is shown to have a number of desirable invariance\nproperties leading to provable advantages over more conventional penalties such\nas the \u21131 norm, especially in areas where existing theoretical recovery guarantees\nno longer hold. This can translate into improved performance in applications such\nas model selection with correlated features, source localization, and compressive\nsensing with constrained measurement directions.\n\n1\n\nIntroduction\n\nWe begin with the generative model\n\nY = \u03a6X0 + E,\n\n(1)\nwhere \u03a6 \u2208 Rn\u00d7m is a dictionary of basis vectors or features, X0 \u2208 Rm\u00d7t is a matrix of unknown\ncoef\ufb01cients we would like to estimate, Y \u2208 Rn\u00d7t is an observed signal matrix, and E is a noise\nmatrix with iid elements distributed as N (0, \u03bb). The objective is to estimate the unknown genera-\ntive X0 under the assumption that it is row-sparse, meaning that many rows of X0 have zero norm.\nThe problem is compounded considerably by the additional assumption that m > n, meaning the\ndictionary \u03a6 is overcomplete. When t = 1, this then reduces to the canonical sparse estimation of\na coef\ufb01cient vector with mostly zero-valued entries or minimal \u21130 norm [7]. In contrast, estimation\nof X0 with t > 1 represents the more general simultaneous sparse approximation problem [6, 15]\nrelevant to numerous applications such as compressive sensing and multi-task learning [9, 16], man-\nifold learning [13], array processing [10], and functional brain imaging [1]. We will consider both\nscenarios herein but will primarily adopt the more general notation of the t > 1 case.\nOne possibility for estimating X0 involves solving\n\nX kY \u2212 \u03a6Xk2\nmin\n\nF + \u03bbd(X),\n\n\u03bb > 0,\n\nd(X) ,\n\nI [kxi\u00b7k > 0] ,\n\n(2)\n\nmXi=1\n\nwhere the indicator function I [kxk > 0] equals one if kxk > 0 and equals zero otherwise (kxk\nis an arbitrary vector norm). d(X) penalizes the number of rows in X that are not equal to zero;\n\n\u2217Draft version for NIPS 2011 pre-proceedings.\n\n1\n\n\ffor nonzero rows there is no additional penalty for large magnitudes. Moreover, it reduces to the \u21130\nnorm when t = 1, i.e., d(x) = kxk0, or a count of the nonzero elements in the vector x. Note that\nto facilitate later analysis, we de\ufb01ne x\u00b7i as the i-th column of matrix X while xi\u00b7 represents the i-th\nrow. For theoretical inquiries or low-noise environments, it is often convenient to consider the limit\nas \u03bb \u2192 0, in which case (2) reduces to\nmin\nX\n\ns.t. \u03a6X0 = \u03a6X.\n\nd(X),\n\n(3)\n\nUnfortunately, solving either (2) or (3) involves a combinatorial search and is therefore not tractable\nin practice. Instead, a family of more convenient sparse penalized regression cost functions are re-\nviewed in Section 2. In particular, we discuss conventional Type I sparsity penalties, such as the \u21131\nnorm and the \u21131,2 mixed norm, and a Type II empirical Bayesian alternative characterized by dictio-\nnary dependency. When the dictionary \u03a6 is incoherent, meaning the columns are roughly orthogonal\nto one another, then certain Type I selections are well-known to produce good approximations of X0\nvia ef\ufb01cient implementations. However, as discussed in Section 3, more structured dictionary types\ncan pose dif\ufb01culties. In Section 4 we analyze the underlying cost functions of Type I and Type II,\nand demonstrate that the later maintains several properties that suggest it will be robust to highly\nstructured dictionaries. Brief empirical comparisons are presented in Section 5.\n\n2 Estimation via Sparse Penalized Regression\n\nDirectly solving either (2) or (3) is intractable, so a variety of approximate methods have been\nproposed. Many of these can be viewed simply as regression with a sparsity penalty convenient for\noptimization purposes. The general regression problem we consider here involves solving\n\nX kY \u2212 \u03a6Xk2\nmin\n\nF + \u03bbg(X),\n\n(4)\n\nwhere g is some penalty function of the row norms. Type I methods use a separable penalty of the\nform\n\ng(I)(X) =Xi\n\nh (kxi\u00b7k2) ,\n\n(5)\n\nwhere h is a non-decreasing, typically concave function.1 Common examples include h(z) =\nzp, p \u2208 (0, 1] [11] and h(z) = log(z + \u03b1), \u03b1 \u2265 0 [4]. The parameters p and \u03b1 are often heuristically\nselected on an application-speci\ufb01c basis. In contrast, Type II methods, with origins as empirical\nBayesian estimators, implicitly utilize a more complicated penalty function that can only be ex-\npressed in a variational form [18]. Herein we will consider the selection\n\ng(II) (X) , min\n\u0393(cid:23)0\n\nTr(cid:2)X T \u0393\u22121X(cid:3) + t log(cid:12)(cid:12)\u03b1I + \u03a6\u0393\u03a6T(cid:12)(cid:12) , \u03b1 \u2265 0,\n\nwhere \u0393 is a diagonal matrix of non-negative variational parameters [14, 18]. While less transparent\nthan Type I, it has been shown that (6) is a concave non-decreasing function of each row norm of X,\nhence it promotes row sparsity as well. Moreover, the dictionary-dependency of this penalty appears\nto be the source of some desirable invariance properties as discussed in Section 4. Analogous to (3),\nfor analytical purposes all of these methods can be reduced as \u03bb \u2192 0 to solving\n\n(6)\n\n(7)\n\nmin\nX\n\ng(X)\n\n3 Structured Dictionaries\n\ns.t. \u03a6X0 = \u03a6X.\n\nIt is now well-established that when the dictionary \u03a6 is constructed with appropriate randomness,\ne.g., iid Gaussian entries, then for certain choices of g, in particular the convex selection g(X) =\n\nPi kxi\u00b7k2 (which represents a generalization of the \u21131 vector norm to row-sparse matrices), we\n\ncan expect to recover X0 exactly in the noiseless case or to close approximation otherwise. This\nassumes that d(X0) is suf\ufb01ciently small relative to some function of the dictionary coherence or a\nrelated measure. However, with highly structured dictionaries these types of performance guarantees\ncompletely break down.\n\n1Other row norms, such as the \u2113\u221e, have been considered as well but are less prevalent.\n\n2\n\n\fAt the most basic level, one attempt to standardize structured dictionaries is by utilizing some form\nof column normalization as a pre-processing step. Most commonly, each column is scaled such that\nit has unit \u21132 norm. This helps ensure that no one column is implicitly favored over another during\nthe estimation process. However, suppose our observation matrix is generated via Y = \u03a6X0, where\n\nrepresents a rank one adjustment. If we apply column normalization to remove the effect of D, the\nresulting scale factors will be dominated by the rank one term when \u03c3 is large. But if we do not\ncolumn normalize, then D can completely bias the estimation results.\n\n\u03a6 =e\u03a6D +\u03c3abT ,e\u03a6 is some well-behaved, incoherent dictionary, D is a diagonal matrix, and \u03c3abT\nIn general, if our given dictionary is effectively We\u03a6D, with W an arbitrary invertible matrix that\n\nscales and correlates rows, and D diagonal, the combined effect can be severely disruptive. As\nan example from neuroimaging, the MEG/EEG source localization problem involves estimating\nsparse neural current sources within the brain using sensors placed near the surface of the scalp.\nThe effective dictionary or forward model is characterized by highly correlated rows (because the\nsensors are physically constrained to be near one another) and columns with drastically different\nscales (since deep brain sources produce much weaker signals at the surface than super\ufb01cial ones).\n\narbitrary coherence structure between individual or groups of columns in \u03a6, meaning the structure\n\nMore problematic is the situation where \u03a6 = e\u03a6S, since an unrestricted matrix S can introduce\nof \u03a6 is now arbitrary regardless of how well-behaved the originale\u03a6.\n\n4 Analysis\n\nWe will now analyze the properties of both Type I and Type II cost functions when coherent or highly\nstructured dictionaries are present. Ideally, we would like to arrive at algorithms that are invariant,\nto the extent possible, to dictionary transformations that would otherwise disrupt the estimation\nef\ufb01cacy. For simplicity, we will primarily consider the noiseless case, although we surmise that\nmuch of the underlying intuition carries over into the noiseless domain. This strategy mirrors the\nprogression in the literature of previous sparse estimation theory related to the \u21131 norm [3, 7, 8]. All\nproofs have been deferred to the Appendix, with some details omitted for brevity.\n\n4.1\n\nInvariance to W and D\n\ng(II)(X)\n\nmin\nX\n\nWe will \ufb01rst consider the case where the observation matrix is produced via Y = \u03a6X0 = We\u03a6DX0.\nLater in Sections 4.2 and 4.3 we will then address the more challenging situation where \u03a6 =e\u03a6S.\nLemma 1. Let W denote an arbitrary full-rank n \u00d7 n matrix and D an arbitrary full-rank m \u00d7 m\ndiagonal matrix. Then with \u03b1 \u2192 0, the Type II optimization problem\ns.t. We\u03a6DX0 = We\u03a6DX\n(8)\n\nis invariant to W and D in the sense that if X \u2217 is a global (or local) minimum to (8), then D\u22121X \u2217\n\nis a global (or local) minimum when we optimize g(II)(X) subject to the constrainte\u03a6X0 =e\u03a6X.\nTherefore, while switching between \u03a6 = We\u03a6D and \u03a6 =e\u03a6 may in\ufb02uence the initialization and pos-\n\nsibly the update rules of a particular Type II algorithm, it does not fundamentally alter the underlying\ncost function. In contrast, Type I methods do not satisfy this invariance. Invariance is preserved with\na W factor in isolation. Likewise, inclusion of a D factor alone with column normalization leads to\ninvariance. However, inclusion of both W and D together can be highly disruptive.\nNote that for improving Type I performance, it is not suf\ufb01cient to apply some row decorrelating and\nnormalizing \u02c6W \u22121 to \u03a6 and then column normalize with some \u02c6D\u22121. This is because the application\nof \u02c6D\u22121 will disrupt the effects of \u02c6W \u22121. But one possibility to compensate for dictionary structure is\nto jointly learn a \u02c6W \u22121 and \u02c6D\u22121 that produces a \u03a6 satisfying: (i) \u03a6\u03a6T = CI (meaning rows have a\nconstant \u21132 norm of C and are uncorrelated, (ii) k\u03c6\u00b7ik2 = 1 for all i. Up to irrelevant scale factors, a\nunique such transformation will always exist. In Section 5 we empirically demonstrate that this can\nbe a highly effective strategy for improving the performance of Type I methods. However, as a \ufb01nal\npoint, we should mention that the invariance Type II exhibits towards W and D (or any corrected\nform of Type I) will no longer strictly hold once noise is added.\n\n3\n\n\f4.2\n\nInvariance to S: The t > 1 Case (Simultaneous Sparse Approximation)\n\nWe now turn to the potentially more problematic scenario with \u03a6 = e\u03a6S. We will assume that S\nis arbitrary with the only restriction being that the resulting \u03a6 satis\ufb01es spark[\u03a6] = n + 1, where\nmatrix spark quanti\ufb01es the smallest number of linearly dependent columns [7]. Consequently, the\nspark condition is equivalent to saying that each n \u00d7 n sub-matrix of \u03a6 is full rank. This relatively\nweak assumption is adopted for simplicity; in many cases it can be relaxed.\n\nLemma 2. Let \u03a6 be an arbitrary dictionary with spark [\u03a6] = n + 1 and X0 a coef\ufb01cient matrix with\nd(X0) < n. Then there exists a constant \u03c1 > 0 such that the optimization problem (7), with g(X) =\ng(II)(X) and \u03b1 \u2192 0, has no local minima and a unique, global solution at X0 if (x0)T\ni\u00b7 (x0)j\u00b7 \u2264 \u03c1\nfor all i 6= j (i.e., the nonzero rows of X0 are below some correlation threshold). Also, if we enforce\nexactly zero row-wise correlations, meaning \u03c1 = 0, then a minimizing solution X \u2217 will satisfy\ni\u00b7k2 = k(x0)i\u00b7k2 for all i (i.e., a matching row-sparsity support), even for d(X0) \u2265 n. This\nkx\u2217\nsolution will be unique whenever \u03a6X0X T\n0 \u03a6 = \u03a6\u0393\u03a6T has a unique solution for some non-negative,\ndiagonal \u0393.2\n\nCorollary 1. There will always exist dictionaries \u03a6 and coef\ufb01cients X0, consistent with the con-\nditions from Lemma 2, such that the optimization problem (7) with any possible g(X) of the form\n\ng(I)(X) =Pi h (kxi\u00b7k2) will have minimizing solutions not equal to X0 (with or without column\n\nnormalization).\n\nIn general, Lemma 2 suggests that for estimation purposes uncorrelated rows in X0 can potentially\ncompensate for troublesome dictionary structure, and together with Corollary 1 it also describes a\npotential advantage of Type II over Type I. Of course this result only stipulates suf\ufb01cient conditions\nfor recovery that are certainly not necessary, i.e., effective sparse recovery is possible even with\ncorrelated rows (more on this below). We also emphasize that the \ufb01nal property of Lemma 2 implies\nthat the row norms of X0 (and therefore the row-sparsity support) can still be recovered even up\nto the extreme case of d(X0) = m > n. While this may seem surprising at \ufb01rst, especially since\neven brute force minimization of (3) can not achieve a similar feat, it is important to keep in mind\nthat (3) is blind to the correlation structure of X0. Although Type II does not explicitly require any\nsuch structure, it is able to outperform (3) by implicitly leveraging this structure when the situation\nhappens to be favorable. While space prevents a full treatment, in the context of MEG/EEG source\nestimation, we have successfully localized 500 nonzero sources (rows) using a 100\u00d71000 dictionary.\nHowever, what about the situation where strong correlations do exist between the nonzero rows of\nX0? A couple things are worth mentioning in this regard. First, Lemma 2 can be strengthened\nconsiderably via the expanded optimization problem: minX,B g(II)(X) s.t. \u03a6X0 = \u03a6XB, which\nachieves a result similar to Lemma 2 but with a weaker correlation condition (although the row-\nnorm recovery property is lost). Secondly, in the case of perfect correlation between rows (the\nhardest case), the problem reduces to an equivalent one with t = 1, i.e., it exactly reduces to the\ncanonical sparse recovery problem. We address this situation next.\n\n4.3\n\nInvariance to S: The t = 1 Case (Standard Sparse Approximation)\n\nThis section considers the t = 1 case, meaning Y = y and X0 = x0 are now vectors. For\nconvenience, we de\ufb01ne X (S,P) as the set of all coef\ufb01cient vectors in Rm with support (or nonzero\ncoef\ufb01cient locations) speci\ufb01ed by the index set S \u2282 {1, . . . , m} and sign pattern given by P \u2208\n{\u22121, +1}|S| (here the | \u00b7 | operator denotes the cardinality of a set).\nLemma 3. Let \u03a6 be an arbitrary dictionary with spark [\u03a6] = n + 1. Then for any X (S,P) with\n|S| < n, there exists a non-empty subset \u00afX \u2282 X (S,P) (with nonzero Lebesgue measure), such\nthat if x0 \u2208 \u00afX , the Type II minimization problem\n(9)\n\ng(II)(x)\n\nmin\n\nwill have a unique minimum and it will be located at x0.\n\nx\n\ns.t. \u03a6x0 = \u03a6x, \u03b1 \u2192 0\n\n2See Appendix for more details about this condition. In most situations, it will hold if m < n(n + 1)/2,\n\nand likely for many instances with m even greater than this.\n\n4\n\n\fThis Lemma can be obtained with a slight modi\ufb01cation of results in [18]. In other words, no mat-\nter how poorly structured a particular dictionary is with regard to a given sparsity pro\ufb01le, there\nwill always be sparse coef\ufb01cients we are guaranteed to recover (provided we utilize a convergent\nalgorithm). In contrast, an equivalent claim can not be made for Type I:\n\nLemma 4. Given an arbitrary Type I penalty g(I)(x) =Pi h(|xi|), with h a \ufb01xed, non-decreasing\nfunction, there will always exist a dictionary \u03a6 (with or without normalized columns) and set\nX (S,P) such that for any x0 \u2208 X (S,P), the problem\n\nmin\n\nx\n\ng(I)(x)\n\ns.t. \u03a6x0 = \u03a6x\n\n(10)\n\nwill not have a unique minimum located at x0.\n\nThis can happen because the global minimum does not equal x0 and/or because of the presence of\nlocal minima. Of course this does not necessarily imply that a particular Type I algorithm will fail.\nFor example, even with multiple minima, an appropriate optimization strategy could conceivably\nstill locate an optimum that coincides with x0. While it is dif\ufb01cult to analyze all possible algorithms,\nwe can address one in\ufb02uential variety based on iterative reweighted \u21131 minimization [4, 18]. Here\nthe idea is that if h is concave and differentiable, then a convergent means of minimizing (10) is to\nutilize a \ufb01rst-order Taylor series approximation of g(I)(x) at some point \u02c6x. This leads to an iterative\n, dh(z)/dz|z=|\u02c6xi| and then minimize\nprocedure where at each step we must \ufb01rst compute h\u2032\ni\ni|xi| subject to \u03a6x0 = \u03a6x to update \u02c6x. This method produces a sparse estimate at each\niteration and is guaranteed to converge to a local minima (or stationary point) of (10). However, this\nsolution may be suboptimal in the following sense:\n\nPi h\u2032\nCorollary 2. Given an arbitrary g(I)(x) as in Lemma 4, there will always exist a \u03a6 and X (S,P),\nsuch that for any x0 \u2208 X (S,P), iterative reweighted \u21131 minimization will not converge to x0 when\ninitialized at the minimum \u21131 norm solution.\n\nNote that this failure does not result from a convergence pathology. Rather, the presence of minima\ndifferent from x0 explicitly disrupts the algorithm.\nIn general, with highly structured dictionaries deviating from the ideal, the global minimum of con-\nvex penalties often does not correspond with x0 as theoretical equivalence results break down. This\nin turn suggests the use of concave penalty functions to seek possible improvement. However, as\nillustrated by the following result, even the simplest of sparse recovery problems, that of estimating\nsome x0 with only one nonzero element using a dictionary with a 1D null-space, Type I can be\ncharacterized by problematic local minima with (strictly) concave penalties. For this purpose we\nde\ufb01ne \u03c6\u2217 as an arbitrary column of \u03a6 and \u00af\u03a6\u2217 as all columns of \u03a6 excluding \u03c6\u2217.\n\nLemma 5. Let h denote a concave, non-decreasing function with h\u2032\n, limz\u21920 dh(z)/dz and\n, limz\u2192\u221e dh(z)/dz. Also, let \u03a6 be a dictionary with unit \u21132 norm columns and spark [\u03a6] =\nh\u2032\nm = n + 1 (i.e., a 1D null-space), and let x0 satisfy kx0k0 = 1 with associated \u03c6\u2217. Then the Type\nI problem (10) can have multiple local minima if\n\nmax\n\nmin\n\nh\u2032\nmax\nh\u2032\n\nmin\n\n> k \u00af\u03a6\u22121\n\n\u2217 \u03c6\u2217k1.\n\n(11)\n\nThis result has a very clear interpretation related to how dictionary coherence can potentially disrupt\neven the most rudimentary of estimation tasks. The righthand side of (11) is bounded from below by\n1, which is approached whenever one or more columns in some \u00af\u03a6\u2217 are similar to \u03c6\u2217 (i.e., coherent).\nThus, even the slightest amount of curvature (or strict concavity) in h can lead to the inequality being\nsatis\ufb01ed when highly coherent columns are present. While obviously with h(z) = z this will not be\nan issue (consistent with the well-known convexity of the \u21131 problem), for many popular non-convex\npenalties, this gradient ratio may be large relative to the righthand side, indicating that local minima\n\n5\n\n\fare always possible. For example, with the h(z) = log(z + \u03b1) selection from [4] h\u2032\nmin \u2192 0 for all \u03b1\nwhile h\u2032\nmax \u2192 1/\u03b1. We note that Type II has provably no local minima in this regime (this follows\nas a special case of Lemma 3). Of course the point here is not that Type I algorithms are incapable\nof solving simple problems with kx0k0 = 1 (and any iterative reweighted \u21131 scheme will succeed\non the \ufb01rst step anyway). Rather, Lemma 5 merely demonstrates how highly structured dictionaries\ncan begin to have negative effects on Type I, potentially more so than with Type II, even on trivial\ntasks. The next section will empirically explore this conjecture.\n\n5 Empirical Results\n\nWe now present two simulation examples illustrating the potential bene\ufb01ts of Type II with highly\nstructured dictionaries. In the \ufb01rst experiment, the dictionary represents an MEG lead\ufb01eld, which at\na high level can be viewed as a mapping from the electromagnetic (EM) activity within m brain vox-\nels to n sensors placed near the scalp surface. Computed using Maxwell\u2019s equations and a spherical\nshell head model [12], the resulting \u03a6 is characterized by highly correlated rows, because the small\nscalp surface requires that sensors be placed close together, and vastly different column norms, since\nthe EM \ufb01eld strength drops off rapidly for deep brain sources. These effects are well represented by a\n\ndictionary such as \u03a6 = We\u03a6D as discussed previously. Figure 1 (Left) displays trial-averaged results\n\ncomparing Type I algorithms with Type II using such an MEG lead\ufb01eld dictionary. Data generation\nproceeded as follows: We produce \u03a6 by choosing 50 random sensor locations and 100 random vox-\nels within the brain volume. We then create a coef\ufb01cient matrix X0 with t = 5 columns and d(X0)\nan experiment-dependent parameter. Nonzero rows of X0 are drawn iid from a unit Gaussian distri-\nbution. The observation matrix is then computed as Y = \u03a6X0. We run each algorithm and attempt\nto estimate X0, calculating the probability of success averaged over 200 trials as d(X0) is varied\nfrom 10 to 50. We compared Type II, implemented via a simple iterative reweighted \u21132 approach,\nwith two different Type I schemes. The \ufb01rst is a homotopy continuation method using the Type I\n2 + \u03b1), where \u03b1 is gradually reduced to zero during the estimation\nprocess [5]. We have often found this to be the near optimal Type I approach on a variety empirical\n\npenalty g(I)(X) =Pi log(kxi\u00b7k2\ntests. Secondly, we used the standard mixed-norm penalty g(I)(X) = kXk1,2 =Pi kxi\u00b7k2, which\n\nleads to a convex minimization problem that generalizes basis pursuit (or the lasso), to the t > 1\ndomain [6, 10].\n\nWhile Type II displays invariance to W - and D-like transformations, Type I methods do not. Con-\nsequently, we examined two dictionary-standardization methods for Type I. First, we utilized basic\n\u21132 column normalization, without which Type I will have dif\ufb01culty with the vastly different column\nscalings of \u03a6. Secondly, we developed an algorithm to learn a transformed dictionary \u02c6U \u03a6 \u02c6\u03a0, with\n\u02c6U arbitrary, \u02c6\u03a0 diagonal, such that the combined dictionary has uncorrelated, unit \u21132 norm rows, and\nunit \u21132 norm columns (as discussed in Section 4.1). Figure 1(left) contains results from all of these\nvariants, where it is clear that some compensation for the dictionary structure is essential for good\nrecovery performance. We also note that Type II still outperforms Type I in all cases, suggesting\nthat even after transformation of the latter, there is still residual structure in the MEG lead\ufb01eld being\nexploited by Type II. This is a very reasonable assumption given that \u03a6 will typically have strong\ncolumn-wise correlations as well, which are more effectively modeled by right multiplication by\nsome S. As a \ufb01nal point, the Type II success probability does not go to zero even when d(X0) = 50,\nimplying that in some cases it is able to \ufb01nd a number of nonzeros equal to the number of rows in \u03a6.\nThis is possible because even with only t = 5 columns, the nonzero rows of X0 display somewhat\nlimited sample correlation, and so exact support recovery is still possible. With t > 5 these sample\ncorrelations can be reduced further, allowing consistent support recovery when d(X0) > n (not\nshown).\n\nsecond experiment with explicitly controlled correlations among groups of columns. For each trial\n\nTo further test the ability of Type II to handle structure imposed by some e\u03a6S, we performed a\nwe generated a 50 \u00d7 100 Gaussian iid dictionary e\u03a6. Correlations were then introduced using a\nblock-diagonal S with 4 \u00d7 4 blocks created with iid entries drawn from a uniform distribution\n(between 0 and 1). The resulting \u03a6 = e\u03a6S was then scaled to have unit \u21132 norm columns. We\nthen generated a random x0 vector (t = 1 case) using iid Gaussian nonzero entries with d(x0)\nvaried from 10 to 25 (with t = 1, we cannot expect to recover as many nonzeros as when t = 5).\nSignal vectors are computed as y = \u03a6x0 or, for purposes of direct comparison with a canonical\niid dictionary, y = e\u03a6x0. We evaluated Type II and the Type I iterative reweighted \u21131 minimization\n\n6\n\n\fmethod from [4], which is guaranteed to do as well or better than standard \u21131 norm minimization.\n\nType II performance is essentially unchanged, Type I performance degrades substantially.\n\nTrial-averaged results using both \u03a6 ande\u03a6 are shown in Figure 1(right), where it is clear that while\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n \n\n0\n10\n\n \n\nType II\nType I, homotopy, norm\nType I, homotopy, invariant\nType I, basis pursuit, norm\nType I, basis pursuit, invariant\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\nType II, iid \u03a6\nType II, coherent \u03a6\nType I, iid \u03a6\nType I, coherent \u03a6\n\n \n\n15\n\n20\n\n25\n\n30\n\nrow\u2212sparsity\n\n35\n\n40\n\n45\n\n50\n\n0.1\n\n \n\n10\n\n15\n\nrow\u2212sparsity\n\n20\n\n25\n\nFigure 1: Left: Probability of success recovering coef\ufb01cient matrices with varying degrees of row-\nsparsity using an MEG lead\ufb01eld as the dictionary. Two Type I methods were compared, a homotopy\ncontinuation method from [5] and a version of basis pursuit extended to the simultaneous sparse\napproximation problem by minmizing the \u21131,2 mixed norm [6, 10]. Type I methods were compared\nusing standard \u21132 column normalization and a learned invariance transformation. Right: Probability\n\nwith clustered columns. The Type I method was the interactive reweighted \u21131 algorithm from [4].\n\nof success recovering sparse vectors using a Gaussian iid dictionarye\u03a6 and a coherent dictionary \u03a6\n\n6 Conclusion\n\nWhen we are free to choose the basis vectors of an overcomplete signal dictionary, the sparse estima-\ntion problem is supported by strong analytical and empirical foundations. However, there are many\napplications where physical restrictions or other factors impose rigid constraints on the dictionary\nstructure such that the assumptions of theoretical recovery guarantees are violated. Examples in-\nclude model selection problems with correlated features, source localization, and compressive sens-\ning with constrained measurement directions. This can have signi\ufb01cant consequences depending\non how the estimated coef\ufb01cients will ultimately be utilized. For example, in the source localiza-\ntion problem, correlated dictionary columns may correspond with drastically different regions (e.g.,\nbrain areas), so recovering the exact sparsity pro\ufb01le can be important. Ideally we would like our\nrecovery algorithms to display invariance, to the extent possible, to the actual structure of the dic-\ntionary. With typical Type I sparsity penalties this can be a dif\ufb01cult undertaking; however, with the\nnatural dictionary dependence of the Type II penalty, to some extent it appears this structure can be\naccounted for, leading to more consistent performance across dictionary types.\n\nAppendix\n\nHere we provide brief proofs of several results from the paper. Some details have been omitted for\nspace considerations.\nProof of Lemma 1: First we address invariance with respect to W . Obviously the equality constraint\nis unaltered by a full rank W , so it only remains to check that the dictionary-dependent penalty\n\npurposes, this point is established. With respect to D, we re-parameterize the problem by de\ufb01ning\n\ng(II) is invariant. However, since by standard determinant relationships log |We\u03a6D\u0393De\u03a6T W T| =\nlog |W||e\u03a6D\u0393De\u03a6T||W T| = log |e\u03a6D\u0393De\u03a6T|+C, where C is an irrelevant constant for optimization\neX , DX ande\u0393 , D\u0393D. It is then readily apparent that the penalty (6) satis\ufb01es\nTr(cid:2)X T \u0393\u22121X(cid:3) + log |e\u03a6D\u0393De\u03a6T| = min\ng(II) (X) \u2261 min\nSo we are effectively solving: min eX g(II)(eX) s.t. e\u03a6DX0 =e\u03a6eX.\n\nTrheX Te\u0393\u22121eXi + log |e\u03a6e\u0393e\u03a6T|.\n\nProof of Lemma 2 and Corollary 1: Minimizing the Type II cost function can be accomplished\nequivalently by minimizing\n\n(12)\n\n(cid:4)\n\n\u0393(cid:23)0\n\ne\u0393(cid:23)0\n\n7\n\n\f\u03a6 =\" \u01eb\n\n\u01eb\n1 \u22121\n0\n0\n\n1\n0\n\n1\n0\n\n\u01eb \u2212\u01eb # , X(1) =\uf8ee\uf8ef\uf8f0\n\n1\n1\n1 \u22121\n0\n0\n0\n0\n\n\uf8f9\uf8fa\uf8fb , X(2) =\uf8ee\uf8ef\uf8f0\n\n0\n1\n0 \u22121\n0\n\u01eb\n\u01eb\n0\n\n\uf8f9\uf8fa\uf8fb ,\n\n(14)\n\nL(\u0393) , Trh\u03a6t\u22121X0X T\n\n0 \u03a6T(cid:0)\u03a6\u0393\u03a6T(cid:1)\u22121i + log |\u03a6\u0393\u03a6T|,\n\n0 \u03a6T , meaning the off-diagonal elements of X0X T\n\n(13)\nover the non-negative diagonal matrix \u0393 (this follows from a duality principle in Type II models\n0 \u03a6T and a parameterized model covariance\n[18]). L(\u0393) includes an observed covariance \u03a6t\u22121X0X T\n0 ] [17]. Moreover, if \u03a6\u0393\u2217\u03a6T is suf\ufb01-\n\u03a6\u0393\u03a6T , and is globally minimized with \u0393\u2217 = t\u22121diag[X0X T\n0 are not too large, then\nciently close to t\u22121\u03a6X0X T\nit can be shown by differentiating along the direction between any arbitrary point \u0393\u2032 and \u0393\u2217 that no\nlocal minima exist, leading to the \ufb01rst part of Lemma 2.\nRegarding the second part, we now allow d(X0) to be arbitrary but require that X0X T\n0 be diagonal\n(zero correlations). Using similar arguments as above, it is easily shown that any minimizing solu-\ntion \u0393\u2217 must satisfy \u03a6\u0393\u2217\u03a6T = \u03a6t\u22121X0X T\n0 \u03a6T . This equality can be viewed as n(n + 1)/2 linear\nequations (equal to the number of unique elements in an n\u00d7 n covariance matrix) and m unknowns,\nnamely, the diagonal elements of \u0393\u2217. Therefore, if n(n + 1)/2 > m this system of equations will\ntypically be overdetermined (e.g., if suitable randomness is present to avoid adversarial conditions)\nwith a unique solution. Moreover, because of the requirement that \u0393 be non-negative, it is likely that\na unique solution will exist in many cases where m is even greater than n(n + 1)/2 [2].\nFinally, we address Corollary 1. First, consider the case where t = 1, so X0 = x0. To satisfy the\nnow degenerate correlation condition, we must have d(x0) = 1. Even in this simple regime it can be\ndemonstrated that a unique minimum at x0 is possible iff h(z) = z based on Lemma 5 (below) and a\ncomplementary result in [17]. So the only Type I possibility is h(z) = z. A simple counterexample\nwith t = 2 serves to rule this selection out. Consider a dictionary \u03a6 and two coef\ufb01cient matrices\ngiven by\n\nIt is easily veri\ufb01ed that \u03a6X(1) = \u03a6X(2) and that X(1) = X0, the maximally row-sparse\nsolution. Computing the Type I cost for each with h(z) = z gives g(I)(X(1)) = 2\u221a2 and\ng(I)(X(2)) = 2(1 + \u01eb). Thus, if we allow \u01eb to be small, g(I)(X(2)) < g(I)(X(1)), so X(1) = X0\ncannot be the minimizing solution. Note that \u21132 column normalization will not change this\nconclusion since all columns of \u03a6 have equal norm already.\n(cid:4)\n\nProof of Lemma 4 and Corollary 2: For brevity, we will assume that h is concave and differentiable,\nas is typical of most sparsity penalties used in practice (the more general case follows with some\nadditional effort). This of course includes h(z) = z, which is both concave and convex, and leads\nto the \u21131 norm penalty. These results will now be demonstrated using a simple counterexample\nsimilar to the one above. Assume we have the dictionary \u03a6 from (14), and that S = {1, 2} and\nP = {+1, +1}, which implies that any x0 \u2208 X (S,P) can be expressed as x0 = [\u03b11, \u03b12, 0, 0]T , for\nsome \u03b11, \u03b12 > 0. We will now show that with any member from this set, there will not be a unique\nminimum to the Type I cost at x0 for any possible concave, differentiable h.\nFirst assume \u03b11 \u2265 \u03b12. Consider the alternative feasible solution x(2) = [(\u03b11 \u2212 \u03b12), 0, \u01eb\u03b12, \u01eb\u03b12]T .\nTo check if this is a local minimum, we can evaluate the gradient of the penalty function g(I)(x)\nalong the feasible region near x(2). Given v = [1, 1,\u2212\u01eb,\u2212\u01eb]T \u2208 Null(\u03a6), this can be accomplished\nby computing \u2202g(I)(x(2) + \u03b2v)/\u2202\u03b2 = h\u2032(|\u03b11 \u2212 \u03b12 + \u03b2|) + h\u2032(|\u03b2|) + 2\u01ebh\u2032(|\u01eb\u03b12 \u2212 \u01eb\u03b2|). In the limit\nas \u03b2 \u2192 0 (from the right or left), this expression will always be positive for \u01eb < 0.5 based on the\nconcavity of h. Therefore, x(2) must be a minimum. By symmetry an equivalent argument can be\nmade when \u03b12 \u2265 \u03b11. (In the special case where \u03b11 = \u03b12, there will actually exist two maximally\nsparse solutions, the generating x0 and x(2).) It is also straightforward to verify analytically that\niterative reweighted \u21131 minimization will fail on this example when initialized at the minimum \u21131\nnorm solution. It will always become trapped at x(2) after the \ufb01rst iteration, assuming \u03b11 \u2265 \u03b12, or\na symmetric local minimum otherwise.\n\n(cid:4)\n\nProof of Lemma 5: This result can be shown by examining properties of various gradients along\nthe feasible region, not unlike some of the analysis above, and then bounding the resultant quantity.\nWe defer these details to a later publication.\n(cid:4)\n\n8\n\n\fReferences\n\n[1] S. Baillet, J.C. Mosher, and R.M. Leahy, \u201cElectromagnetic brain mapping,\u201d IEEE Signal\n\nProcessing Magazine, pp. 14\u201330, Nov. 2001.\n\n[2] A.M. Bruckstein, M. Elad, and M. Zibulevsky, \u201cA non-negative and sparse enough solution of\nan underdetermined linear system of equations is unique,\u201d IEEE Trans. Information Theory,\nvol. 54, no. 11, pp. 4813\u20134820, Nov. 2008.\n\n[3] E. Cand`es, J. Romberg, and T. Tao, \u201cRobust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information,\u201d IEEE Trans. Information Theory, vol. 52, no.\n2, pp. 489\u2013509, Feb. 2006.\n\n[4] E. Cand`es, M. Wakin, and S. Boyd, \u201cEnhancing sparsity by reweighted \u21131 minimization,\u201d J.\n\nFourier Anal. Appl., vol. 14, no. 5, pp. 877\u2013905, 2008.\n\n[5] R. Chartrand and W. Yin, \u201cIteratively reweighted algorithms for compressive sensing,\u201d Proc.\n\nInt. Conf. Accoustics, Speech, and Signal Proc., 2008.\n\n[6] S.F. Cotter, B.D. Rao, K. Engan, and K. Kreutz-Delgado, \u201cSparse solutions to linear inverse\nproblems with multiple measurement vectors,\u201d IEEE Trans. Signal Processing, vol. 53, no. 7,\npp. 2477\u20132488, April 2005.\n\n[7] D.L. Donoho and M. Elad, \u201cOptimally sparse representation in general (nonorthogonal) dic-\ntionaries via \u21131 minimization,\u201d Proc. National Academy of Sciences, vol. 100, no. 5, pp. 2197\u2013\n2202, March 2003.\n\n[8] J.J. Fuchs, \u201cOn sparse representations in arbitrary redundant bases,\u201d IEEE Trans. Information\n\nTheory, vol. 50, no. 6, pp. 1341\u20131344, June 2004.\n\n[9] S. Ji, D. Dunson, and L. Carin, \u201cMulti-task compressive sensing,\u201d IEEE Trans. Signal Pro-\n\ncessing, vol. 57, no. 1, pp. 92\u2013106, Jan 2009.\n\n[10] D.M. Malioutov, M. C\u00b8 etin, and A.S. Willsky, \u201cSparse signal reconstruction perspective for\nsource localization with sensor arrays,\u201d IEEE Transactions on Signal Processing, vol. 53, no.\n8, pp. 3010\u20133022, August 2005.\n\n[11] B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado, \u201cSubset selection in noise\nbased on diversity measure minimization,\u201d IEEE Trans. Signal Processing, vol. 51, no. 3, pp.\n760\u2013770, March 2003.\n\n[12] J. Sarvas, \u201cBasic methematical and electromagnetic concepts of the biomagnetic inverse prob-\n\nlem,\u201d Phys. Med. Biol., vol. 32, pp. 11\u201322, 1987.\n\n[13] J.G. Silva, J.S. Marques, and J.M. Lemos, \u201cSelecting landmark points for sparse manifold\n\nlearning,\u201d Advances in Neural Information Processing Systems 18, pp. 1241\u20131248, 2006.\n\n[14] M.E. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d Journal of\n\nMachine Learning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[15] J.A. Tropp, \u201cAlgorithms for simultaneous sparse approximation. Part II: Convex relaxation,\u201d\n\nSignal Processing, vol. 86, pp. 589\u2013602, April 2006.\n\n[16] M.B. Wakin, M.F. Duarte, S. Sarvotham, D. Baron, and R.G. Baraniuk, \u201cRecovery of jointly\nsparse signals from a few random projections,\u201d Advances in Neural Information Processing\nSystems 18, pp. 1433\u20131440, 2006.\n\n[17] D.P. Wipf, Bayesian Methods for Finding Sparse Representations, PhD Thesis, University of\n\nCalifornia, San Diego, 2006.\n\n[18] D.P. Wipf and S. Nagarajan, \u201cIterative reweighted \u21131 and \u21132 methods for \ufb01nding sparse solu-\ntions,\u201d J. Selected Topics in Signal Processing (Special Issue on Compressive Sensing), vol. 4,\nno. 2, April 2010.\n\n9\n\n\f", "award": [], "sourceid": 1135, "authors": [{"given_name": "David", "family_name": "Wipf", "institution": null}]}