{"title": "\u2113\u2080-norm Minimization for Basis Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 1513, "page_last": 1520, "abstract": "", "full_text": "\u21130-norm Minimization for Basis Selection\n\nDavid Wipf and Bhaskar Rao \u2217\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego, CA 92092\ndwipf@ucsd.edu, brao@ece.ucsd.edu\n\nAbstract\n\nFinding the sparsest, or minimum \u21130-norm, representation of a signal\ngiven an overcomplete dictionary of basis vectors is an important prob-\nlem in many application domains. Unfortunately, the required optimiza-\ntion problem is often intractable because there is a combinatorial increase\nin the number of local minima as the number of candidate basis vectors\nincreases. This de\ufb01ciency has prompted most researchers to instead min-\nimize surrogate measures, such as the \u21131-norm, that lead to more tractable\ncomputational methods. The downside of this procedure is that we have\nnow introduced a mismatch between our ultimate goal and our objective\nfunction. In this paper, we demonstrate a sparse Bayesian learning-based\nmethod of minimizing the \u21130-norm while reducing the number of trou-\nblesome local minima. Moreover, we derive necessary conditions for\nlocal minima to occur via this approach and empirically demonstrate that\nthere are typically many fewer for general problems of interest.\n\n1\n\nIntroduction\n\nSparse signal representations from overcomplete dictionaries \ufb01nd increasing relevance in\nmany application domains [1, 2]. The canonical form of this problem is given by,\n\nmin\n\nw\n\nkwk0,\n\ns.t. t = \u03a6w,\n\n(1)\n\nwhere \u03a6 \u2208 \u211cN \u00d7M is a matrix whose columns represent an overcomplete basis (i.e.,\nrank(\u03a6) = N and M > N), w is the vector of weights to be learned, and t is the sig-\nnal vector. The actual cost function being minimized represents the \u21130-norm of w (i.e., a\ncount of the nonzero elements in w). In this vein, we seek to \ufb01nd weight vectors whose\nentries are predominantly zero that nonetheless allow us to accurately represent t.\n\nWhile our objective function is not differentiable, several algorithms have nonetheless been\nderived that (i), converge almost surely to a solution that locally minimizes (1) and more\nimportantly (ii), when initialized suf\ufb01ciently close, converge to a maximally sparse solution\nthat also globally optimizes an alternate objective function. For convenience, we will refer\nthese approaches as local sparsity maximization (LSM) algorithms. For example, proce-\ndures that minimize \u2113p-norm-like diversity measures1 have been developed such that, if p is\nchosen suf\ufb01ciently small, we obtain a LSM algorithm [2, 3]. Likewise, a Gaussian entropy-\nbased LSM algorithm called FOCUSS has been developed and successfully employed to\n\n\u2217This work was supported by an ARCS Foundation scholarship, DiMI grant 22-8376 and Nissan.\n1Minimizing a diversity measure is often equivalent to maximizing sparsity.\n\n\fsolve Neuromagnetic imaging problems [4]. A similar algorithm was later discovered in\n[5] from the novel perspective of a Jeffrey\u2019s noninformative prior. While all of these meth-\nods are potentially very useful candidates for solving (1), they suffer from one signi\ufb01cant\ndrawback: as we have discussed in [6], every local minima of (1) is also a local minima to\nthe LSM algorithms.\nUnfortunately, there are many local minima to (1). In fact, every basic feasible solution w\u2217\nto t = \u03a6w is such a local minimum.2 To see this, we note that the value of kw\u2217k0 at such a\nsolution is less than or equal to N. Any other feasible solution can be written as w\u2217 + \u03b1w\u2032,\nwhere w\u2032 \u2208 Null(\u03a6). For simplicity, if we assume that every subset of N columns of \u03a6 are\nlinearly independent, the unique representation property (URP), then w\u2032 must necessarily\nhave nonzero elements in locations that differ from w\u2217. Consequently, any solution in the\nneighborhood of w\u2217 will satisfy kw\u2217k0 < kw\u2217 + \u03b1w\u2032k0. This ensures that all such w\u2217\nrepresent local minima to (1).\n\nN (cid:1) + 1 and(cid:0)M\n\nN(cid:1); the exact\n\nThe number of basic feasible solutions is bounded between(cid:0)M \u22121\n\nnumber depends on t and \u03a6 [4]. Regardless, when M \u226b N, we have an large number\nof local minima and not surprisingly, we often converge to one of them using currently\navailable LSM algorithms. One potential remedy is to employ a convex surrogate measure\nin place of the \u21130-norm that leads to a more tractable optimization problem. The most\ncommon choice is to use the alternate norm kwk1, which creates a unimodal optimization\nproblem that can be solved via linear programming or interior point methods. The consid-\nerable price we must pay, however, is that the global minimum of this objective function\nneed not coincide with the sparsest solutions to (1).3 As such, we may fail to recover the\nmaximally sparse solution regardless of the initialization we use (unlike a LSM procedure).\n\nIn this paper, we will demonstrate an alternative algorithm for solving (1) using a sparse\nBayesian learning (SBL) framework. Our objective is twofold. First, we will prove that,\nunlike minimum \u21131-norm methods, the global minimum of the SBL cost function is only\nachieved at the minimum \u21130-norm solution to t = \u03a6w. Later, we will show that this\nmethod is only locally minimized at a subset of basic feasible solutions and therefore, has\nfewer local minima than current LSM algorithms.\n\n2 Sparse Bayesian Learning\nSparse Bayesian learning was initially developed as a means of performing robust regres-\nsion using a hierarchal prior that, empirically, has been observed to encourage sparsity [8].\nThe most basic formulation proceeds as follows. We begin with an assumed likelihood\nmodel of our signal t given \ufb01xed weights w,\n\np(t|w) = (2\u03c0\u03c32)\u2212N/2 exp(cid:18)\u2212\n\n1\n\n2\u03c32 kt \u2212 \u03a6wk2(cid:19) .\n\nTo provide a regularizing mechanism, we assume the parameterized weight prior,\n\np(w; \u03b3) =\n\nMYi=1\n\n(2\u03c0\u03b3i)\u22121/2 exp(cid:18)\u2212\n\nw2\ni\n\n2\u03b3i(cid:19) ,\n\nwhere \u03b3 = [\u03b31, . . . , \u03b3M ]T is a vector of M hyperparameters controlling the prior variance\nof each weight. These hyperparameters (along with the error variance \u03c32 if necessary) can\nbe estimated from the data by marginalizing over the weights and then performing ML\noptimization. The marginalized pdf is given by\n\np(t; \u03b3) =Z p(t|w)p(w; \u03b3)dw = (2\u03c0)\u2212N/2 |\u03a3t|\u22121/2 exp(cid:20)\u2212\n\n1\n2\n\ntT \u03a3\u22121\n\nt t(cid:21) ,\n\n2A basic feasible solution is a solution with at most N nonzero entries.\n3In very restrictive settings, it has been shown that the minimum \u21131-norm solution can equal the\n\nminimum \u21130-norm solution [7]. But in practical situations, this result often does not apply.\n\n(2)\n\n(3)\n\n(4)\n\n\fwhere \u03a3t , \u03c32I + \u03a6\u0393\u03a6T and we have introduced the notation \u0393 , diag(\u03b3).4 This pro-\ncedure is referred to as evidence maximization or type-II maximum likelihood [8]. Equiv-\nalently, and more conveniently, we may instead minimize the cost function\n\nL(\u03b3; \u03c32) = \u2212 log p(t; \u03b3) \u221d log |\u03a3t| + tT \u03a3\u22121\nt t\n\nusing the EM algorithm-based update rule for the (k + 1)-th iteration given by\n\n\u02c6w(k+1) = E(cid:2)w|t; \u03b3(k)(cid:3) =(cid:16)\u03a6T \u03a6 + \u03c32\u0393\u22121\n(k)(cid:17)\u22121\n\u03b3(k+1) = E(cid:2)diag(wwT )|t; \u03b3(k)(cid:3) = diag(cid:20) \u02c6w(k) \u02c6wT\n\n\u03a6T t\n\n(k) +(cid:16)\u03c3\u22122\u03a6T \u03a6 + \u0393\u22121\n\n(k)(cid:17)\u22121(cid:21) .(7)\n\n(5)\n\n(6)\n\nUpon convergence to some \u03b3M L, we compute weight estimates as \u02c6w = E[w|t; \u03b3M L],\nallowing us to generate \u02c6t = \u03a6 \u02c6w \u2248 t. We now quantify the relationship between this\nprocedure and \u21130-norm minimization.\n\n3\n\n\u21130-norm minimization via SBL\n\nAlthough SBL was initially developed in a regression context, it can nonetheless be easily\nadapted to handle (1) by \ufb01xing \u03c32 to some \u03b5 and allowing \u03b5 \u2192 0. To accomplish this we\nmust reexpress the SBL iterations to handle the low noise limit. Applying standard matrix\nidentities and the general result\n\nwe arrive at the modi\ufb01ed update rules\n\n= U \u2020,\n\nlim\n\u03b5\u21920\n\nU T (cid:0)\u03b5I + U U T(cid:1)\u22121\n(k)(cid:17)\u2020\n(k) (cid:16)\u03a6\u03931/2\n\u03b3(k+1) = diag(cid:18) \u02c6w(k) \u02c6wT\n(k) +(cid:20)I \u2212 \u03931/2\n\n\u02c6w(k) = \u03931/2\n\nt\n\n(k)(cid:17)\u2020\n(k) (cid:16)\u03a6\u03931/2\n\n(8)\n\n(9)\n\n(10)\n\n\u03a6(cid:21) \u0393(k)(cid:19) ,\n\nwhere (\u00b7)\u2020 denotes the Moore-Penrose pseudo-inverse. We observe that all \u02c6w(k) are feasi-\nble, i.e., t = \u03a6 \u02c6w(k) for all \u03b3(k).5 Also, upon convergence we can easily show that if \u03b3M L\nis sparse, \u02c6w will also be sparse while maintaining feasibility. Thus, we have potentially\nfound an alternative way of solving (1) that is readily computable via the modi\ufb01ed itera-\ntions above. Perhaps surprisingly, these update rules are equivalent to the Gaussian entropy-\nbased LSM iterations derived in [2, 5], with the exception of the [I \u2212 \u03931/2\n(k) )\u2020\u03a6]\u0393(k)\nterm.\nA \ufb01rm connection with \u21130-norm minimization is realized when we consider the global\nminimum of L(\u03b3; \u03c32 = \u03b5) in the limit as \u03b5 approaches zero. We will now quantify this\nrelationship via the following theorem, which extends results from [6].\nTheorem 1. Let W0 denote the set of weight vectors that globally minimize (1). Further-\nmore, let W(\u03b5) be de\ufb01ned as the set of weight vectors\n\n(k) (\u03a6\u03931/2\n\n(cid:26)w\u2217\u2217 : w\u2217\u2217 =(cid:0)\u03a6T \u03a6 + \u03b5\u0393\u22121\n\u2217\u2217(cid:1)\u22121\n\n\u03a6T t, \u03b3\u2217\u2217 = arg min\n\n\u03b3\n\nL(\u03b3; \u03c32 = \u03b5)(cid:27) .\n\n(11)\n\nThen in the limit as \u03b5 \u2192 0, if w \u2208 W(\u03b5), then w \u2208 W0.\n\n4We will sometimes use \u0393 and \u03b3 interchangeably when appropriate.\n5This assumes that t is in the span of the columns of \u03a6 associated with nonzero elements in \u03b3,\n\nwhich will always be the case if t is in the span of \u03a6 and all \u03b3 are initialized to nonzero values.\n\n\fA full proof of this result is available at [9]; however, we provide a brief sketch here. First,\nwe know from [6] that every local minimum of L(\u03b3; \u03c32 = \u03b5) is achieved at a basic feasible\nsolution \u03b3\u2217 (i.e., a solution with N or fewer nonzero entries), regardless of \u03b5. Therefore,\nin our search for the global minimum, we only need examine the space of basic feasible\nsolutions. As we allow \u03b5 to become suf\ufb01ciently small, we show that\nL(\u03b3\u2217; \u03c32 = \u03b5) = (N \u2212 k\u03b3\u2217k0) log(\u03b5) + O(1)\n\n(12)\nat any such solution. This result is minimized when k\u03b3\u2217k0 is as small as possible. A max-\nimally sparse basic feasible solution, which we denote \u03b3\u2217\u2217, can only occur with nonzero\nelements aligned with the nonzero elements of some w \u2208 W0. In the limit as \u03b5 \u2192 0, w\u2217\u2217\nbecomes feasible while maintaining the same sparsity pro\ufb01le as \u03b3\u2217\u2217, leading to the stated\nresult.\n\nThis result demonstrates that the SBL framework can provide an effective proxy to direct\n\u21130-norm minimization. More importantly, we will now show that the limiting SBL cost\nfunction, which we will henceforth denote\n\nt,\n\n(13)\n\nneed not have the same problematic local minima pro\ufb01le as other methods.\n\nL(\u03b3; \u03c32 = \u03b5) = log(cid:12)(cid:12)\u03a6\u0393\u03a6T(cid:12)(cid:12) + tT (cid:0)\u03a6\u0393\u03a6T(cid:1)\u22121\n\nL(\u03b3) , lim\n\u03b5\u21920\n\n4 Analysis of Local Minima\n\nThus far, we have demonstrated that there is a close af\ufb01liation between the limiting SBL\nframework and the the minimization problem posed by (1). We have not, however, provided\nany concrete reason why SBL should be preferred over current LSM methods of \ufb01nding\nsparse solutions. In fact, this preference is not established until we carefully consider the\nproblem of convergence to local minima.\nAs already mentioned, the problem with current methods of minimizing kwk0 is that ev-\nery basic feasible solution unavoidably becomes a local minimum. However, what if we\ncould somehow eliminate all or most of these extrema. For example, consider the alternate\nobjective function f (w) , min(kwk0, N ), leading to the optimization problem\n\nmin\n\nf (w),\n\nw\n\ns.t. t = \u03a6w.\n\n(14)\n\nWhile the global minimum remains unchanged, we observe that all local minima occur-\nring at non-degenerate basic feasible solutions have been effectively removed.6 In other\nwords, at any solution w\u2217 with N nonzero entries, we can always add a small component\n\u03b1w\u2032 \u2208 Null(\u03a6) without increasing f (w), since f (w) can never be greater than N. There-\nfore, we are free to move from basic feasible solution to basic feasible solution without\nincreasing f (w). Also, the rare degenerate basic solutions that do remain, even if subop-\ntimal, are sparser by de\ufb01nition. Therefore, locally minimizing our new problem (14) is\nclearly superior to locally minimizing (1). But how can we implement such a minimization\nprocedure, even approximately, in practice?\n\nAlthough we cannot remove all non-degenerate local minima and still retain computational\nfeasibility, it is possible to remove many of them, providing some measure of approxima-\ntion to (14). This is effectively what is accomplished using SBL as will be demonstrated\nbelow. Speci\ufb01cally, we will derive necessary conditions required for a non-degenerate ba-\nsic feasible solution to represent a local minimum to L(\u03b3). We will then show that these\nconditions are frequently not satis\ufb01ed, implying that there are potentially many fewer local\nminima. Thus, locally minimizing L(\u03b3) comes closer to (locally) minimizing (14) than\ncurrent LSM methods, which in turn, is closer to globally minimizing kwk0.\n\n6A degenerate basic feasible solution has strictly less than N nonzero entries; however, the vast\n\nmajority of local minima are non-degenerate, containing exactly N nonzero entries.\n\n\f4.1 Necessary Conditions for Local Minima\n\nAs previously stated, all local minima to L(\u03b3) must occur at basic feasible solutions \u03b3\u2217.\nNow suppose that we have found a (non-degenerate) \u03b3\u2217 with associated w\u2217 computed\nvia (9) and we would like to assess whether or not it is a local minimum to our SBL\n\nBut how do we quantify this relationship for the purposes of analyzing local minima?\n\nAs it turns out, a useful metric for comparison is realized when we decompose x with\n\nlikely that if we are not at a true local minimum, then there must exist at least one additional\n\nwill be shown below, the similarity required between x and t (needed for establishing the\nexistence of a local minimum) may then be realized by comparing the respective weights\n\ncost function. For convenience, let ew denote the N nonzero elements of w\u2217 and e\u03a6 the\nassociated columns of \u03a6 (therefore, t = e\u03a6ew and ew = e\u03a6\u22121t). Intuitively, it would seem\ncolumn of \u03a6 not ine\u03a6, e.g., some x, that is somehow aligned with or in some respect similar\nto t. Moreover, the signi\ufb01cance of this potential alignment must be assessed relative to e\u03a6.\nrespect to e\u03a6, which forms a basis in \u211cN under the URP assumption. For example, we\nmay form the decomposition x = e\u03a6ev, where ev is a vector of weights analogous to ew. As\nev and ew. In more familiar terms, this is analogous to suggesting that similar signals have\nsimilar Fourier expansions. Loosely, we may expect that if ev is \u2018close enough\u2019 to ew, then\nx is suf\ufb01ciently close to t (relative to all other columns in e\u03a6) such that we are not at a local\nN and only N nonzero entries and associated basic feasible solution ew = e\u03a6\u22121t. Let X\ndenote the set of M \u2212 N columns of \u03a6 not included in e\u03a6 and V the set of weights given by\nnev : ev = e\u03a6\u22121x, x \u2208 Xo. Then \u03b3\u2217 is a local minimum of L(\u03b3) only if\n\nminimum. We formalize this idea via the following theorem:\nTheorem 2. Let \u03a6 satisfy the URP and let \u03b3\u2217 represent a vector of hyperparameters with\n\n(15)\n\n< 0\n\nXi6=j evievj\newiewj\n\n\u2200ev \u2208 V.\n\nProof : If \u03b3\u2217 truly represents a local minimum of our cost function, then the following\ncondition must hold for all x \u2208 X :\n\n\u2202L(\u03b3\u2217)\n\n\u2202\u03b3x\n\n\u2265 0,\n\n(16)\n\nwhere \u03b3x denotes the hyperparameter corresponding to the basis vector x. In words, we\ncannot reduce L(\u03b3\u2217) along a positive gradient because this would push \u03b3x below zero.\nUsing the matrix inversion lemma, the determinant identity, and some algebraic manipula-\ntions, we arrive at the expression\n\n\u2202L(\u03b3\u2217)\n\n\u2202\u03b3x\n\n=\n\nxT Bx\n\n1 + \u03b3xxT Bx\n\n\u2212(cid:18)\n\ntT Bx\n\n1 + \u03b3xxT Bx(cid:19)2\n\n,\n\n(17)\n\nwhere B , (e\u03a6e\u0393e\u03a6T )\u22121. Since we have assumed that we are at a local minimum, it is\nstraightforward to show thate\u0393 = diag(ew)2 leading to the expression\n\n(18)\nSubstituting this expression into (17) and evaluating at the point \u03b3x = 0, the above gradient\nreduces to\n\nB = e\u03a6\u2212T diag(ew)\u22122e\u03a6\u22121.\n= evT (cid:0)diag(ew\u22121ew\u2212T ) \u2212 ew\u22121ew\u2212T(cid:1)ev,\n\nN ]T . This leads directly to the stated theorem.\n\n\u2202L(\u03b3\u2217)\n\n\u2202\u03b3x\n\n(19)\n\n(cid:4)\n\nwhere ew\u22121 , [ew\u22121\n\n1 , . . . , ew\u22121\n\n\fThis theorem provides a useful picture of what is required for local minima to exist and\nmore importantly, why many basic feasible solutions are not local minimum. Moreover,\nthere are several convenient ways in which we can interpret this result to accommodate a\nmore intuitive perspective.\n\n4.2 A Simple Geometric Interpretation\n\nIn general terms, if the signs of each of the elements in a givenev match up with ew, then the\n\nspeci\ufb01ed condition will be violated and we cannot be at a local minimum. We can illustrate\nthis geometrically as follows.\n\nTo begin, we note that our cost function L(\u03b3) is invariant with respect to re\ufb02ections of\nany basis vectors about the origin, i.e., we can multiply any column of \u03a6 by \u22121 and the\ncost function does not change. Returning to a candidate local minimum with associated\n\ne\u03a6, we may therefore assume, without loss of generality, that e\u03a6 \u2261 e\u03a6diag (sgn(w)), giving\nus the decomposition t = e\u03a6w, w > 0. Under this assumption, we see that t is located\nin the convex cone formed by the columns of e\u03a6. We can infer that if any x \u2208 X (i.e.,\nany column of \u03a6 not in e\u03a6) lies in this convex cone, then the associated coef\ufb01cientsev must\n\u2212e\u03a6 leads to the same result). Consequently, Theorem 2 ensures that we are not at a local\n\nall be positive by de\ufb01nition (likewise, by a similar argument, any x in the convex cone of\n\nminimum. The simple 2D example shown in Figure 1 helps to illustrate this point.\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\u22121\n\nt \n\n\u03c6\n \n2\n\nx \n\n\u03c6\n \n1\n\n\u22120.8\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\u22121\n\nt \n\n\u03c6\n \n2\n\n\u03c6\n \n1\n\nx \n\n\u22120.8\n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nFigure 1: 2D example with a 2 \u00d7 3 dictionary \u03a6 (i.e., N = 2 and M = 3) and a basic\n\nfeasible solution using the columns e\u03a6 = [\u03c61 \u03c62]. Left: In this case, x = \u03c63 does not\n\npenetrate the convex cone containing t, and we do not satisfy the conditions of Theorem 2.\nThis con\ufb01guration does represent a minimizing basic feasible solution. Right: Now x is in\nthe cone and therefore, we know that we are not at a local minimum; but this con\ufb01guration\ndoes represent a local minimum to current LSM methods.\n\nAlternatively, we can cast this geometric perspective in terms of relative cone sizes. For\n\nnot at a local minimum to L(\u03b3) if there exists a second convex cone C formed from a\n\nFigure 1(right), we obtain a tighter cone by swapping x for \u03c62.\nWhile certainly useful, we must emphasize that in higher dimensions, these geometric\n\nexample, let Ce\u03a6 represent the convex cone (and its re\ufb02ection) formed by e\u03a6. Then we are\nsubset of columns of \u03a6 such that t \u2208 C \u2282 Ce\u03a6, i.e., C is a tighter cone containing t. In\nconditions are much weaker than (15), e.g., if all x are not in the convex cone of e\u03a6, we\nin local minima from the (cid:0)M \u22121\nN(cid:1) bounds is dependent on the distribution of\n\nstill may not be at a local minimum. In fact, to guarantee a local minimum, all x must\nbe reasonably far from this cone as quanti\ufb01ed by (15). Of course the ultimate reduction\n\nN (cid:1) + 1 to (cid:0)M\n\n\fM/N\n\n1.3\n\n1.6\n\n2.0\n\n2.4\n\n3.0\n\nSBL Local Minimum % 4.9% 4.0% 3.2% 2.3% 1.6%\n\nTable 1: Given 1000 trials where FOCUSS has converged to a suboptimal local minimum,\nwe tabulate the percentage of times the local minimum is also a local minimum to SBL.\nM/N refers to the overcompleteness ratio of the dictionary used, with N \ufb01xed at 20. Re-\nsults using other algorithms are similar.\n\nbasis vectors in t-space. In general, it is dif\ufb01cult to quantify this reduction except in a few\nspecial cases.7 However, we will now proceed to empirically demonstrate that the overall\nreduction in local minima is substantial when the basis vectors are randomly distributed.\n\n5 Empirical Comparisons\n\nTo show that the potential reduction in local minima derived above translates into concrete\nresults, we conducted a simulation study using randomized basis vectors distributed isomet-\nrically in t-space. Randomized dictionaries are of interest in signal processing and other\ndisciplines [2, 7] and represent a viable benchmark for testing basis selection methods.\nMoreover, we have performed analogous experiments with other dictionary types (such as\npairs of orthobases) leading to similar results (see [9] for some examples).\n\nOur goal was to demonstrate that current LSM algorithms often converge to local minima\nthat do not exist in the SBL cost function. To accomplish this, we repeated the following\nprocedure for dictionaries of various sizes. First, we generate a random N \u00d7 M \u03a6 whose\ncolumns are each drawn uniformly from a unit sphere. Sparse weight vectors w0 are ran-\ndomly generated with kw0k0 = 7 (and uniformly distributed amplitudes on the nonzero\ncomponents). The vector of target values is then computed as t = \u03a6w0. The LSM algo-\nrithm is then presented with t and \u03a6 and attempts to learn the minimum \u21130-norm solutions.\nThe experiment is repeated a suf\ufb01cient number of times such that we collect 1000 examples\nwhere the LSM algorithm converges to a local minimum. In all these cases, we check if the\ncondition stipulated by Theorem 2 applies, allowing us to determine if the given solution is\na local minimum to the SBL algorithm or not. The results are contained in Table 1 for the\nFOCUSS LSM algorithm. We note that, the larger the overcompleteness ratio M/N, the\nlarger the total number of LSM local minima (via the bounds presented earlier). However,\nthere also appears to be a greater probability that SBL can avoid any given one.\n\nIn many cases where we found that SBL was not locally minimized, we initialized the\nSBL algorithm in this location and observed whether or not it converged to the optimal\nsolution. In roughly 50% of these cases, it escaped to \ufb01nd the maximally sparse solution.\nThe remaining times, it did escape in accordance with theory; however, it converged to\nanother local minimum. In contrast, when we initialize other LSM algorithms at an SBL\nlocal minima, we always remain trapped as expected.\n\n6 Discussion\n\nIn practice, we have consistently observed that SBL outperforms current LSM algorithms\nin \ufb01nding maximally sparse solutions (e.g., see [9]). The results of this paper provide a\nvery plausible explanation for this improved performance: conventional LSM procedures\nare very likely to converge to local minima that do not exist in the SBL landscape. However,\n\n7For example, in the special case where t is proportional to a single column of \u03a6, we can show\n\nthat the number of local minima reduces from(cid:0)M \u22121\n\nN (cid:1) +1 to 1, i.e., we are left with a single minimum.\n\n\fit may still be unclear exactly why this happens. In conclusion, we give a brief explanation\nthat provides insight into this issue.\n\nConsider the canonical FOCUSS LSM algorithm or the Figueiredo algorithm from [5]\n(with \u03c32 \ufb01xed to zero, the Figueiredo algorithm is actually equivalent to the FOCUSS\nalgorithm). These methods essentially solve the problem\n\nmin\n\nw\n\nMXi=1\n\nlog |wi|,\n\ns.t. t = \u03a6w,\n\n(20)\n\nwhere the objective function is proportional to the Gaussian entropy measure. In contrast,\nwe can show that, up to a scale factor, any minimum of L(\u03b3) must also be a minimum of\n\nmin\n\n\u03b3\n\nNXi=1\n\nlog \u03bbi(\u03b3),\n\ns.t. \u03b3 \u2208 \u2126\u03b3,\n\n(21)\n\nwhere \u03bbi(\u03b3) is the i-th eigenvalue of \u03a6\u0393\u03a6T and \u2126\u03b3 is the convex set {\u03b3 :\n\ntT (cid:0)\u03a6\u0393\u03a6T(cid:1)\u22121\n\nt \u2264 1, \u03b3 \u2265 0}.\n\nIn both instances, we are minimizing a Gaussian entropy measure over a convex constraint\nset. The crucial difference resides in the particular parameterization applied to this mea-\nsure. In (20), we see that if any subset of |wi|\u2019s becomes signi\ufb01cantly small (e.g., as we\napproach a basic feasible solution), we enter the basin of a local minimum because the asso-\nciated log |wi| terms becomes enormously negative; hence the one-to-one correspondence\nbetween basic feasible solutions and local minima of the LSM algorithms.\nIn contrast, when working with (21), many of the \u03b3i\u2019s may approach zero without becoming\ntrapped, as long as \u03a6\u0393\u03a6T remains reasonably well-conditioned. In other words, since \u03a6\nis overcomplete, up to M \u2212 N of the \u03b3i\u2019s can be zero while still maintaining a full set\nof nonzero eigenvalues to \u03a6\u0393\u03a6T , so no term in the summation is driven towards minus\nin\ufb01nity as occurred above. Thus, we can switch from one basic feasible solution to another\nin many instances while still reducing our objective function. It is in this respect that SBL\napproximates the minimization of the alternative objective posed by (14).\n\nReferences\n[1] S.S. Chen, D.L. Donoho, and M.A. Saunders, \u201cAtomic decomposition by basis pursuit,\u201d SIAM\n\nJournal on Scienti\ufb01c Computing, vol. 20, no. 1, pp. 33\u201361, 1999.\n\n[2] B.D. Rao and K. Kreutz-Delgado, \u201cAn af\ufb01ne scaling methodology for best basis selection,\u201d\n\nIEEE Transactions on Signal Processing, vol. 47, no. 1, pp. 187\u2013200, January 1999.\n\n[3] R.M. Leahy and B.D. Jeffs, \u201cOn the design of maximally sparse beamforming arrays,\u201d IEEE\n\nTransactions on Antennas and Propagation, vol. 39, no. 8, pp. 1178\u20131187, Aug. 1991.\n\n[4] I. F. Gorodnitsky and B. D. Rao, \u201cSparse signal reconstruction from limited data using FOCUSS:\nA re-weighted minimum norm algorithm,\u201d IEEE Transactions on Signal Processing, vol. 45, no.\n3, pp. 600\u2013616, March 1997.\n\n[5] M.A.T. Figueiredo, \u201cAdaptive sparseness using Jeffreys prior,\u201d Neural Information Processing\n\nSystems, vol. 14, pp. 697\u2013704, 2002.\n\n[6] D.P. Wipf and B.D. Rao, \u201cSparse Bayesian learning for basis selection,\u201d IEEE Transactions on\n\nSignal Processing, vol. 52, no. 8, pp. 2153\u20132164, 2004.\n\n[7] D.L. Donoho and M. Elad, \u201cOptimally sparse representation in general (nonorthogonal) dictio-\nnaries via \u21131 minimization,\u201d Proc. National Academy of Sciences, vol. 100, no. 5, pp. 2197\u20132202,\nMarch 2003.\n\n[8] M.E. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d Journal of Machine\n\nLearning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[9] D.P. Wipf and B.D. Rao, \u201cSome results on sparse Bayesian learning,\u201d ECE Department Techni-\n\ncal Report, University of California, San Diego, 2005.\n\n\f", "award": [], "sourceid": 2726, "authors": [{"given_name": "David", "family_name": "Wipf", "institution": null}, {"given_name": "Bhaskar", "family_name": "Rao", "institution": null}]}