{"title": "From Bayesian Sparsity to Gated Recurrent Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 5554, "page_last": 5564, "abstract": "The iterations of many first-order algorithms, when applied to minimizing common regularized regression functions, often resemble neural network layers with pre-specified weights.  This observation has prompted the development of learning-based approaches that purport to replace these iterations with enhanced surrogates forged as DNN models from available training data.  For example, important NP-hard sparse estimation problems have recently benefitted from this genre of upgrade, with simple feedforward or recurrent networks ousting proximal gradient-based iterations.  Analogously, this paper demonstrates that more powerful Bayesian algorithms for promoting sparsity, which rely on complex multi-loop majorization-minimization techniques, mirror the structure of more sophisticated long short-term memory (LSTM) networks, or alternative gated feedback networks previously designed for sequence prediction.  As part of this development, we examine the parallels between latent variable trajectories operating across multiple time-scales during optimization, and the activations within deep network structures designed to adaptively model such characteristic sequences.  The resulting insights lead to a novel sparse estimation system that, when granted training data, can estimate optimal solutions efficiently in regimes where other algorithms fail, including practical direction-of-arrival (DOA) and 3D geometry recovery problems.   The underlying principles we expose are also suggestive of a learning process for a richer class of multi-loop algorithms in other domains.", "full_text": "From Bayesian Sparsity to Gated Recurrent Nets\n\nHao He\n\nMassachusetts Institute of Technology\n\nBo Xin\n\nMicrosoft Research, Beijing, China\n\nhaohe@mit.edu\n\njimxinbo@gmail.com\n\nSatoshi Ikehata\n\nNational Institute of Informatics\nsatoshi.ikehata@gmail.com\n\nDavid Wipf\n\nMicrosoft Research, Beijing, China\n\ndavidwipf@gmail.com\n\nAbstract\n\nThe iterations of many \ufb01rst-order algorithms, when applied to minimizing common\nregularized regression functions, often resemble neural network layers with pre-\nspeci\ufb01ed weights. This observation has prompted the development of learning-\nbased approaches that purport to replace these iterations with enhanced surrogates\nforged as DNN models from available training data. For example, important NP-\nhard sparse estimation problems have recently bene\ufb01tted from this genre of upgrade,\nwith simple feedforward or recurrent networks ousting proximal gradient-based\niterations. Analogously, this paper demonstrates that more powerful Bayesian\nalgorithms for promoting sparsity, which rely on complex multi-loop majorization-\nminimization techniques, mirror the structure of more sophisticated long short-term\nmemory (LSTM) networks, or alternative gated feedback networks previously\ndesigned for sequence prediction. As part of this development, we examine the\nparallels between latent variable trajectories operating across multiple time-scales\nduring optimization, and the activations within deep network structures designed\nto adaptively model such characteristic sequences. The resulting insights lead to\na novel sparse estimation system that, when granted training data, can estimate\noptimal solutions ef\ufb01ciently in regimes where other algorithms fail, including\npractical direction-of-arrival (DOA) and 3D geometry recovery problems. The\nunderlying principles we expose are also suggestive of a learning process for a\nricher class of multi-loop algorithms in other domains.\n\nIntroduction\n\n1\nMany practical iterative algorithms for minimizing an energy function Ly(x), parameterized by some\nvector y, adopt the updating prescription\n\nx(t+1) = f (Ax(t) + By),\n\n(1)\n\nwhere t is the iteration count, A and B are \ufb01xed matrices/\ufb01lters, and f is a point-wise nonlinear\noperator. When we treat By as a bias or exogenous input, then the progression of these iterations\nthrough time resembles activations passing through the layers (indexed by t) of a deep neural network\n(DNN) [20, 30, 34, 38]. It then naturally begs the question: If we have access to an ensemble of pairs\n{y, x\u2217}, where x\u2217 = arg minx Ly(x), can we train an appropriately structured DNN to produce a\nminimum of Ly(x) when presented with an arbitrary new y as input? If A and B are \ufb01xed for all t,\nthis process can be interpreted as training a recurrent neural network (RNN), while if they vary, a\ndeep feedforward network with independent weights on each layer is a more apt description.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 + \u03bb(cid:107)x(cid:107)0,\n\nAlthough many of our conclusions may ultimately have broader implications, in this work we focus\non minimizing the ubiquitous sparse estimation problem\nLy(x) = (cid:107)y \u2212 \u03a6x(cid:107)2\n\n(2)\nwhere \u03a6 \u2208 Rn\u00d7m is an overcomplete matrix of feature vectors, (cid:107) \u00b7 (cid:107)0 is the (cid:96)0 norm equal to a count\nof the nonzero elements in a vector, and \u03bb > 0 is a trade-off parameter. Although crucial to many\napplications [2, 9, 13, 17, 23, 27], solving (2) is NP-hard, and therefore ef\ufb01cient approximations are\nsought. Popular examples with varying degrees of computational overhead include convex relaxations\nsuch as (cid:96)1-norm regularization [4, 8, 32] and many \ufb02avors of iterative hard-thresholding (IHT) [5, 6].\nIn most cases, these approximate algorithms can be implemented via (1), where A and B are\nfunctions of \u03a6, and the nonlinearity f is, for example, a hard-thresholding operator for IHT or\nsoft-thresholding for convex relaxations. However, the Achilles\u2019 heel of all these approaches is that\nthey will generally not converge to good approximate minimizers of (2) if \u03a6 has columns with a high\ndegree of correlation [5, 8], which is unfortunately often the case in practice [35].\nTo mitigate the effects of such correlations, we could leverage the aforementioned correspondence\nwith common DNN structures to learn something like a correlation-invariant algorithm or update\nrules [38], although in this scenario our starting point would be an algorithmic format with known\nde\ufb01ciencies. But if our ultimate goal is to learn a new sparse estimation algorithm that ef\ufb01ciently\ncompensates for structure in \u03a6, then it seems reasonable to invoke iterative algorithms known a priori\nto handle such correlations directly as our template for learned network layers. One important example\nis sparse Bayesian learning (SBL) [33], which has been shown to solve (2) using a principled, multi-\nloop majorization-minimization approach [22] even in cases where \u03a6 displays strong correlations\n[35]. Herein we demonstrate that, when judiciously unfolded, SBL iterations can be formed into\nvariants of long short-term memory (LSTM) cells, one of the more popular recurrent deep neural\nnetwork architectures [21], or gated extensions thereof [12]. The resulting network dramatically\noutperforms existing methods in solving (2) with a minimal computational budget. Our high-level\ncontributions can be summarized as follows:\n\n\u2022 Quite surprisingly, we demonstrate that the SBL objective, which explicitly compensates for\ncorrelated dictionaries, can be optimized using iteration structures that map directly to popular\nLSTM cells despite its radically different origin. This association signi\ufb01cantly broadens recent\nwork connecting elementary, one-step iterative sparsity algorithms like (1) with simple recurrent\nor feedforward deep network architectures [20, 30, 34, 38].\n\u2022 At its core, any SBL algorithm requires coordinating inner- and outer-loop computations that\nproduce expensive latent posterior variances (or related, derived quantities) and optimized\ncoef\ufb01cient estimates respectively. Although this process can in principle be accommodated via\ncanonical LSTM cells, such an implementation will enforce that computation of latent variables\nrigidly map to prede\ufb01ned subnetworks corresponding with various gating structures, ultimately\nadministering a \ufb01xed schedule of switching between loops. To provide greater \ufb02exibility in\ncoordinating inner- and outer-loops, we propose a richer gated-feedback LSTM structure for\nsparse estimation.\n\u2022 We achieve state-of-the-art performance on several empirical tasks, including direction-of-\narrival (DOA) estimation [28] and 3D geometry recovery via photometric stereo [37]. In\nthese and other cases, our approach produces higher accuracy estimates at a fraction of the\ncomputational budget. These results are facilitated by a novel online data generation process.\n\u2022 Although learning-to-learn style approaches [1, 20, 30, 34] have been commonly applied to\nrelatively simple gradient descent optimization templates, this is the \ufb01rst successful attempt\nwe are aware of to learn a complex, multi-loop, majorization-minimization algorithm [22]. We\nenvision that such a strategy can have wide-ranging implications beyond the sparse estimation\nproblems explored herein given that it is often not obvious how to optimally tune loop execution\nto balance both complexity and estimation accuracy in practice.\n\n2 Connecting SBL and LSTM Networks\n\nThis section \ufb01rst reviews the basic SBL model, followed an algorithmic characterization of how\ncorrelation structure can be handled during sparse estimation. Later we derive specialized SBL update\nrules that reveal a close association with LSTM cells.\n\n2\n\n\f2.1 Original SBL Model\nGiven an observed vector y \u2208 Rn and feature dictionary \u03a6 \u2208 Rn\u00d7m, SBL assumes the Gaussian\nlikelihood model and a parameterized zero-mean Gaussian prior for the unknown coef\ufb01cients x \u2208 Rm\ngiven by\n\nand p(x; \u03b3) \u221d exp(cid:2)\u2212 1\n\n2 x(cid:62)\u0393\u22121x(cid:3) , \u0393 (cid:44) diag[\u03b3]\n\n2\u03bb (cid:107)y \u2212 \u03a6x(cid:107)2\n\n2\n\n,\n\np(y|x) \u221d exp\n\n(cid:104)\u2212 1\n\n(cid:105)\n\n(3)\n\n\u02c6x = \u0393\u03a6(cid:62)\u03a3\u22121\n\nwhere \u03bb > 0 is a \ufb01xed variance factor and \u03b3 denotes a vector of unknown hyperparamters [33].\nBecause both likelihood and prior are Gaussian, the posterior p(x|y; \u03b3) is also Gaussian, with mean\n\u02c6x satisfying\n\ny y, with \u03a3y (cid:44) \u03a6\u0393\u03a6(cid:62) + \u03bbI.\n\n(4)\nGiven the lefthand-side multiplication by \u0393 in (4), \u02c6x will have a matching sparsity pro\ufb01le or support\npattern as \u03b3, meaning that the locations of zero-valued elements will align or supp[\u02c6x] = supp[\u03b3].\nUltimately then, the SBL strategy shifts from directly searching for some optimally sparse \u02c6x, to\nan optimally sparse \u03b3. For this purpose we marginalize over x (treating it initially as hidden or\nnuisance data) and then maximize the resulting type-II likelihood function with respect to \u03b3 [26].\nConveniently, the resulting convolution-of-Gaussians integral is available in closed-form [33] such\nthat we can equivalently minimize the negative log-likelihood\n\nL(\u03b3) = \u2212 log\n\np(y|x)p(x; \u03b3)dx \u2261 y(cid:62)\u03a3\u22121\n\ny y + log |\u03a3y|.\n\n(5)\n\n(cid:90)\n\nGiven an optimal \u03b3 so obtained, we can compute the posterior mean estimator \u02c6x via (4). Equivalently,\nthis same posterior mean estimator can be obtained by an iterative reweighted (cid:96)1 process described\nnext that exposes subtle yet potent sparsity-promotion mechanisms.\n\n2.2\n\nIterative Reweighted (cid:96)1 Implementation\n\ni\n\nAlthough not originally derived this way, SBL can be implemented using a modi\ufb01ed form of iterative\nreweighted (cid:96)1-norm optimization that exposes its agency for producing sparse estimates. In general,\nif we replace the (cid:96)0 norm from (2) with any smooth approximation g(|x|), where g is a concave,\nnon-decreasing function and | \u00b7 | applies elementwise, then cost function descent1 can be guaranteed\nusing iterations of the form [36]\n(cid:12)(cid:12)(cid:12) , \u2200i. (6)\n2(cid:107)y \u2212 \u03a6x(cid:107)2\nx(t+1) \u2190 arg min\nx\n\n\u2190 \u2202g(u)/\u2202ui|\n\n(cid:12)(cid:12)(cid:12)x(t+1)\n\n(cid:88)\n\nw(t+1)\n\n|xi|,\n\n2 + \u03bb\n\nw(t)\n\nui=\n\n1\n\ni\n\ni\n\ni\n\ni\n\ni w(t)\n\n2 + \u03bb(cid:80)\n(cid:12)(cid:12)(cid:12) + \u0001\n(cid:17)\u22121\n\napproach assumes g(u) = (cid:80)\n\nThis process can be viewed as a multi-loop, majorization-minimization algorithm [22] (a generaliza-\ntion of the EM algorithm [15]), whereby the inner-loop involves computing x(t+1) by minimizing a\n\ufb01rst-order, upper-bounding approximation (cid:107)y \u2212 \u03a6x(cid:107)2\n|xi|, while the outer-loop updates\nthe bound/majorizer itself as parameterized by the weights w(t+1). Obviously, if g(u) = u, then\nw(t) = 1 for all t, and (6) reduces to the Lasso objective for (cid:96)1 norm regularized sparse regression\n[32], and only a single iteration is required. However, one popular non-trivial instantiation of this\ni log (ui + \u0001) with \u0001 > 0 a user-de\ufb01ned parameter [10]. The corre-\nsponding weights then become w(t+1)\n, and we observe that once any particular\nx(t+1)\nbecomes large, the corresponding weight becomes small and at the next iteration a weaker\ni\npenalty will be applied. This prevents the overshrinkage of large coef\ufb01cients, a well-known criticism\nof (cid:96)1 norm penalties [16].\nIn the context of SBL, there is no closed-form w(t+1)\nupdate except in special cases. However, if we\nallow for additional latent structure, which we later show is akin to the memory unit of LSTM cells, a\nviable recurrency emerges for computing these weights and elucidating their effectiveness in dealing\nwith correlated dictionaries. In particular we have:\nProposition 1. If weights w(t+1) satisfy\n\n(cid:16)(cid:12)(cid:12)(cid:12)x(t+1)\n\n=\n\ni\n\ni\n\ni\n\nw(t+1)\n\ni\n\n=\n\nmin\n\nz:supp[z]\u2286supp[\u03b3 (t)]\n\n(cid:107)\u03c6i \u2212 \u03a6z(cid:107)2\n\n2 +\n\n1\n\u03bb\n\nz2\nj\n\n\u03b3(t+1)\nj\n\n(7)\n\nj\u2208supp[\u03b3 (t)]\n\n(cid:16)\n\n(cid:17)2\n\n(cid:88)\n\n1Or global convergence to some stationary point with mild additional assumptions [31].\n\n3\n\n\ffor all i, then the iterations (6), with \u03b3(t+1)\nunchanged the SBL objective (5). Also, at each iteration, \u03b3(t+1) and x(t+1) will satisfy (4).\n\nw(t)\nj\n\n=\n\nj\n\n(cid:104)\n\n(cid:105)\u22121(cid:12)(cid:12)(cid:12)x(t+1)\n\nj\n\n(cid:12)(cid:12)(cid:12), are guaranteed to reduce or leave\n\ni\n\ni\n\nis not dependent solely on the value of the i-th coef\ufb01cient x(t+1)\n\nUnlike the traditional sparsity penalty mentioned above, with SBL we see that the i-th weight\nw(t+1)\n, but rather on all the latent\nhyperparameters \u03b3(t+1) and therefore ultimately prior-iteration weights w(t) as well. Moreover,\nbecause the fate of each sparse coef\ufb01cient is linked together, correlation structure can be properly\naccounted for in a progressive fashion.\nMore concretely, from (7) it is immediately apparent that if \u03c6i \u2248 \u03c6i(cid:48) for some indeces i and i(cid:48)\n(meaning a large degree of correlation), then it is highly likely that w(t+1)\n. This is simply\nbecause the regularized residual error that emerges from solving (7) will tend to be quite similar\nwhen \u03c6i \u2248 \u03c6i(cid:48). In this situation, a suboptimal solution will not be prematurely enforced by weights\nwith large, spurious variance across a correlated group of basis vectors. Instead, weights will differ\nsubstantially only when the corresponding columns have meaningful differences relative to the\ndictionary as a whole, in which case such differences can help to avoid overshrinkage as before.\nA crucial exception to this perspective occurs when \u03b3(t+1) is highly sparse, or nearly so, in which\ncase there are limited degrees of freedom with which to model even small differences between some\n\u03c6i and \u03c6i(cid:48). However, such cases can generally only occur when we are in the neighborhood of ideal,\nmaximally sparse solutions by de\ufb01nition [35], when different weights are actually desirable even\namong correlated columns for resolving the \ufb01nal sparse estimates.\n\n\u2248 w(t+1)\n\ni(cid:48)\n\ni\n\n2.3 Revised SBL Iterations\n\nAlthough presumably there are multiple ways such an architecture could be developed, in this section\nwe derive specialized SBL iterations that will directly map to one of the most common RNN structures,\nnamely LSTM networks. With this in mind, the notation we adopt has been intentionally chosen to\nfacilitate later association with LSTM cells. We \ufb01rst de\ufb01ne\n\nand \u03bd(t) (cid:44) u(t) + \u00b5\u03a6(cid:62)(cid:0)y \u2212 \u03a6u(t)(cid:1) ,\n\n(8)\n\n(cid:20)\n\u03a6(cid:62)(cid:16)\n\nw(t) (cid:44) diag\n\n\u03bbI + \u03a6\u0393(t)\u03a6(cid:62)(cid:17)\u22121\nwhere \u0393(t) (cid:44) diag(cid:2)\u03b3(t)(cid:3), u(t) (cid:44) \u0393(t)\u03a6(cid:62)(cid:16)\n\u03b3(t)(cid:17) (cid:12)(cid:16)(cid:12)(cid:12)(cid:12)\u03bd(t)(cid:12)(cid:12)(cid:12) \u2212 2\u03bbw(t)(cid:17)(cid:105)\nin \u2190(cid:104)\n\n(cid:16)\n\n\u03c3(t)\n\n\u03b1\n\n\u03a6\n\n2\n\n(cid:21) 1\n\u03bbI + \u03a6\u0393(t)\u03a6(cid:62)(cid:17)\u22121\n(cid:16)\n\u03b3(t)(cid:17)\n\nf \u2190 \u03b2\n\u03c3(t)\n\n,\n\n+\n\n,\n\ny, and \u00b5 > 0 is a constant. As\nwill be discussed further below, w(t) serves the exact same role as the weights from (7), hence the\nidentical notation. We then partition our revised SBL iterations as so-called gate updates\n\nout \u2190(cid:16)\n\nw(t)(cid:17)\u22121\n\n\u03c3(t)\n\n, (9)\n\n(10)\n\n(11)\n\ncell updates\n\n\u00afx(t+1) \u2190 sign\n\nand output updates\n\n(cid:104)\n\n\u03bd(t)(cid:105)\n\n,\n\nx(t+1) \u2190 \u03c3(t)\n\nf (cid:12) x(t) + \u03c3(t)\n\nin (cid:12) \u00afx(t+1),\n\nout (cid:12)(cid:12)(cid:12)(cid:12)x(t+1)(cid:12)(cid:12)(cid:12) ,\n\n\u03b3(t+1) \u2190 \u03c3(t)\n\nwhere the inverse and absolute-value operators are applied element-wise when a vector is the argument,\nand at least for now, \u03b1 and \u03b2 de\ufb01ne arbitrary functions. Moreover, (cid:12) denotes the Hadamard product\nand [\u00b7]+ sets negative values to zero and leaves positive quantities unchanged, also in an element-wise\nfashion, i.e., it acts just like a rectilinear (ReLU) unit [29]. Note also that the gate and cell updates\nin isolation can be viewed as computing a \ufb01rst-order, partial solution to the inner-loop weighted (cid:96)1\noptimization problem from (6).\nStarting from some initial \u03b3(0) and x(0), we will demonstrate in the next section that these computa-\ntions closely mirror a canonical LSTM network unfolded in time with y acting as a constant input\napplied at each step. Before doing so however, we must \ufb01rst demonstrate that (8)\u2212(11) indeed serve\nto reduce the SBL objective. For this purpose we require the following de\ufb01nition:\n\n4\n\n\fDe\ufb01nition 2. We say that the iterations (8)\u2212(11) satisfy the monotone cell update property if\n\n(cid:107)y \u2212 \u03a6u(t)(cid:107)2\n\n2 + 2\u03bb\n\n|u(t)\n\n| \u2265 (cid:107)y \u2212 \u03a6x(t+1)(cid:107)2\n\nw(t)\n\ni\n\ni\n\n2 + 2\u03bb\n\n|x(t+1)\n\n|, \u2200t.\n\nw(t)\n\ni\n\ni\n\n(12)\n\n(cid:88)\n\n(cid:88)\n\ni\n\ni\n\nNote that for rather inconsequential technical reasons this de\ufb01nition involves u(t), which can be\nviewed as a proxy for x(t). We then have the following:\nProposition 3. The iterations (8)\u2212(11) will reduce or leave unchanged (5) for all t provided that\nand \u03b1 and \u03b2 are chosen such that the monotone cell update property holds.\n\n\u00b5 \u2208(cid:16)\n\n0, \u03bb/\n\n(cid:13)(cid:13)(cid:13)(cid:105)\n(cid:13)(cid:13)(cid:13)\u03a6(cid:62)\u03a6\n\nIn practical terms, the simple selections \u03b1(\u03b3) = 1 and \u03b2(\u03b3) = 0 will provably satisfy the monotone\ncell update property (see proof details in the supplementary). However, for additional \ufb02exibility, \u03b1\nand \u03b2 could be selected to implement various forms of momentum, ultimately leading to cell updates\nakin to the popular FISTA [4] or monotonic FISTA [3] algorithms. In both cases, old values x(t) are\nprecisely mixed with new factors \u00afx(t+1) to speed convergence (in the present circumstances, \u03c3(t)\nf\nand \u03c3(t)\nin respectively modulate this mixing process via (10)). Of course the whole point of casting\nthe SBL iterations as an RNN structure to begin with is so that we may ultimately learn these types\nof functions, without the need for hand-crafting suboptimal iterations up front.\n\n2.4 Correspondences with LSTM Components\n\nWe will now \ufb02esh out how the SBL iterations presented in Section 2.3 display the same structure as a\ncanonical LSTM cell, the only differences being the shape of the nonlinearities, and the exact details\nof the gate subnetworks. To facilitate this objective, Figure 1 contains a canonical LSTM network\nstructure annotated with SBL-derived quantities. We now walk through these correspondences.\nFirst, the exogenous input to the network is the observation vector y, which does not change from\ntime-step to time-step. This is much like the strategy used by feedback networks for obtaining\nincrementally re\ufb01ned representations [40]. The output at time-step t is \u03b3(t), which serves as the\ncurrent estimate of the SBL hyperparameters. In contrast, we treat x(t) as the internal LSTM memory\ncell, or the latent cell state.2 This deference to \u03b3(t) directly mirrors the emphasis SBL places on\nlearning variances per the marginalized cost from (5) while treating x(t) as hidden data, and in some\nsense \ufb02ips the coef\ufb01cient-centric script used in producing (6).3\nProceeding further, \u03b3(t) is fed to four separate layers/subnetworks (represented by yellow boxes in\nFigure 1): (i) the forget gate \u03c3(t)\nout, and (iv) the\ncandidate input update \u00afx(t). The forget gate computes scaling factors for each element of x(t), with\nsmall values of the gate output suggesting that we \u2018forget\u2019 the corresponding old cell state elements.\nSimilarly the input gate determines how large we rescale signals from the candidate input update \u00afx(t).\nThese two re-weighted quantities are then mixed together to form the new cell state x(t+1). Finally,\nthe output gate modulates how new \u03b3(t+1) are created as scaled versions of the updated cell state.\nRegarding details of these four subnetworks, based on the update templates from (9) and (10), we\nimmediately observe that the required quantities depend directly on (8). Fortunately, both \u03bd(t) and\nw(t) can be naturally computed using simple feedforward subnetwork structures.4 These values can\neither be computed in full (ideal case), or partially to reduce the computational burden. In any event,\nonce obtained, the respective gates and candidate cell input updates can be computed by applying\n\ufb01nal non-linearities. Note that \u03b1 and \u03b2 are treated as arbitrary subnetwork structures at this point\nthat can be learned.\n\nin , (iii) the output gate \u03c3(t)\n\nf , (ii) the input gate \u03c3(t)\n\n2If we allow for peephole connections [18], it is possible to reverse these roles; however, for simplicity and\n\nthe most direct mapping to LSTM cells we do not pursue this alternative here.\n\n3Incidently, this association also suggests that the role of hidden cell updates in LSTM networks can be\nreinterpreted as an analog to the expectation step (or E-step) for estimating hidden data in a suitably structured\nEM algorithm.\n\n4For w(t) the result of Proposition 1 suggests that these weights can be computed as the solution of a simple\nregularized regression problem, which can easily be replaced with a small network analogous to that used in\n[18]; similarly for \u03bd (t).\n\n5\n\n\fA few cosmetic differences remain between this SBL implementation and a canonical LSTM network.\nFirst, the \ufb01nal non-linearity for LSTM gating subnetworks is often a sigmoidal activation, whereas\nSBL is \ufb02exible with the forget gate (via \u03b2), while effectively using a ReLU unit for the input gate\nand an inverse function for the output gate. Moreover, for the candidate cell update subnetwork, SBL\nreplaces the typical tanh nonlinearity with a quantized version, the sign function, and likewise, for the\noutput nonlinearity an absolute value operator (abs) is used. Finally, in terms of internal subnetwork\nstructure, there is some parameter sharing since \u03c3(t)\nout, and \u00afx(t) are connected via \u03bd(t) and w(t).\nOf course in all cases we need not necessarily share parameters nor abide by these exact structures.\nIn fact there is nothing inherently optimal about the particular choices used by SBL; rather it is\nmerely that these structures happen to reproduce the successful, yet hand-crafted SBL iterations. But\ncertainly there is potential in replacing such iterations with learned LSTM-like surrogates, at least\nwhen provided with access to suf\ufb01cient training data as in prior attempts to learn sparse estimation\nalgorithms [20, 34, 38].\n\nin , \u03c3(t)\n\nFigure 1: LSTM/SBL Network\n\nFigure 2: SBL Dynamics\n\n3 Extension to Gated Feedback Networks\nAlthough SBL iterations can be molded into an LSTM structure as we have shown, there remain hints\nthat the full potential of this association may be presently undercooked. Here we \ufb01rst empirically\nexamine the trajectories of SBL iterations produced via the LSTM-like rules derived in Section 2.3.\nThis process will later serve to unmask certain characteristic dynamics operating across different time\nscales that are suggestive of a richer class of gated recurrent network structures inspired by sequence\nprediction tasks [12].\n3.1 Trajectory Analysis of SBL Iterations\nTo begin, Figure 2 displays sample trajectories of w(t) \u2208 R100 (top) and x(t) \u2208 R100 (bottom) during\nexecution of (8)\u2212(11) on a simple representative problem, where each colored line represents a\n| respectively. All details of the data generation process, as well as\ndifferent element w(t)\ncomprehensive attendant analyses, are deferred to the supplementary. To summarize here though, in\nthe top plot the elements of w(t), which represent the non-negative weights forming the outer-loop\nmajorization step from (6) and re\ufb02ect coarse correlation structure in \u03a6, converge very quickly (\u223c3-5\niterations). Moreover, the observed bifurcation of magnitudes ultimately helps to screen many (but\nnot necessarily all) elements of x(t) that are the most likely to be zero in the maximally sparse\ni \u2192 0). In\nrepresentation (i.e., a stable, higher weighting value w(t)\ni\ncontrast, the actual coef\ufb01cients x(t) themselves converge much more slowly, with \ufb01nal destinations\nstill unclear even after 50+ iterations. Hence w(t) need not be continuously updated after rapid initial\nconvergence, provided that we retain a memory of the optimal value during periods when it is static.\nThis discrepancy in convergence rates occurs in part because, as mentioned previously, the gate and\ncell updates do not fully solve the inner-loop weighted (cid:96)1 optimization needed to compute a globally\noptimal x(t+1) give w(t). Varying the number of inner-loop iterations, meaning additional executions\n\nis likely to eventually cause x(t)\n\ni or |x(t)\n\ni\n\n6\n\n!\"($)&($)+\u00d7\u00d7\u00d7\"($'()&($'()&($'())*($)+,-SubnetworkPointwiseOperationVectorTransferConcatenateCopy)./($)\"0($))12$($)1020304050607080901000246810iteration numberw magnitudes10203040506070809010000.20.40.60.81iteration numberx magnitudesFigure1:References11\fof (8)\u2212(11) with w(t) \ufb01xed, is one heuristic for normalizing across different trajectory frequencies,\nbut this requires additional computational overhead, and prior knowledge is needed to micro-manage\niteration counts for either ef\ufb01ciency or \ufb01nal estimation quality. With respect to the latter, we conduct\nadditional experiments in the supplementary which reveal that indeed the number of inner-loop\nupdates per outer-loop cycle can affect the quality of sparse solutions, with no discernible rule of\nthumb for enhancing solution quality.5 For example, navigating around suboptimal local minima\ncould require adaptively adjusting the number inner-loop iterations in subtle, non-obvious ways. We\ntherefore arrive at an unresolved state of affairs:\n\n1. The latent variables which de\ufb01ne SBL iterations can potentially follow optimization trajectories\n\nwith radically different time scales, or both long- and short-term dependencies.\n\n2. But there is no intrinsic mechanism within the SBL framework itself (or most multi-loop\noptimization problems in general either) for automatically calibrating the differing time scales\nfor optimal performance.\n\nThese same issues are likely to arise in other non-convex multi-loop optimization algorithms as well.\nIt therefore behooves us to consider a broader family of model structures that can adapt to these\nscales in a data-dependent fashion.\n\n3.2 Modeling via Gated Feedback Nets\nIn addressing this fundamental problem, we make the following key observation: If the trajectories of\nvarious latent variables can be interpreted as activations passing through an RNN with both long- and\nshort-term dependencies, then in developing a pipeline for optimizing such trajectories it makes sense\nto consider learning deep architectures explicitly designed to adaptively model such characteristic\nsequences. Interestingly, in the context of sequence prediction, the clockwork RNN (CW-RNN) has\nbeen proposed to cope with temporal dependencies engaged across multiple scales [25]. As shown in\nthe supplementary however, the CW-RNN enforces dynamics synced to pre-determined clock rates\nexactly analogous to the \ufb01xed, manual schedule for terminating inner-loops in existing multi-loop\niterative algorithms such as SBL. So we are back at our starting point.\nFortunately though, the gated feedback RNN (GF-RNN) [12] was recently developed to update the\nCW-RNN with an additional set of gated connections that, in effect, allow the network to learn\nits own clock rates. In brief, the GF-RNN involves stacked LSTM layers (or somewhat simpler\ngated recurrent unit (GRU) layers [11]), that are permitted to communicate bilaterally via additional,\ndata-dependent gates that can open and close on different time-scales. In the context of SBL, this\nmeans that we no longer need strain a specialized LSTM structure with the burden of coordinating\ntrajectory dynamics. Instead, we can stack layers that are, at least from a conceptual standpoint,\ndesigned to re\ufb02ect the different dynamics of disparate variable sets such as w(t) or x(t). In doing\nso, we are then positioned to learn new SBL update rules from training pairs {y, x\u2217} as described\npreviously. At the very least, this structure should include SBL-like iterations within its capacity, but\nof course it is also free to explore something even better.\n\n3.3 Network Design and Training Protocol\nWe stack two gated recurrent layers loosely designed to mimic the relatively fast SBL adaptation to\nbasic correlation structure, as well as the slower resolution of \ufb01nal support patterns and coef\ufb01cient\nestimates. These layers are formed from either LSTM or GRU base architectures. For the \ufb01nal output\nlayer we adopt a multi-label classi\ufb01cation loss for predicting supp[x\u2217], which is the well-known \u2018NP-\nhard\u2019 part of sparse estimation (determining \ufb01nal coef\ufb01cient amplitudes just requires a simple least\nsquares \ufb01t given the correct support pattern). Full network details are deferred to the supplementary,\nincluding special modi\ufb01cations to handle complex data as required by DOA applications.\nFor a given dictionary \u03a6 a separate network must be trained via SGD, to which we add a unique\nextra dimension of randomness via an online stochastic data-generation strategy. In particular, to\ncreate samples in each mini-batch, we \ufb01rst generate a vector x\u2217 with random support pattern and\nnonzero amplitudes. We then compute y = \u03a6x\u2217 + \u0001, where \u0001 is a small Gaussian noise component.\nThis y forms a training input sample, while supp[x\u2217] represents the corresponding labels. For all\n\n5In brief, these experiments demonstrate a situation where executing either 1, 10, or 1000 inner-loop iterations\n\nper outer loop fails to produce the optimal solution, while 100 inner-loop iterations is successful.\n\n7\n\n\fmini-batches, novel samples are drawn, which we have found boosts performance considerably over\nthe \ufb01xed training sets used by current DNN approaches to sparse estimation (see supplementary).\n\n4 Experiments\nThis section presents experiments involving synthetic data and two applications.\n\n(a) Strict Accuracy\n\n(b) Loose Accuracy\n\n(c) Architecture Comparisons\n\n(d) DOA\n\nFigure 3: Plots (a), (b), and (c) show sparse recovery results involving synthetic correlated dictionaries.\nPlot (d) shows Chamfer distance-based errors [7] from the direction-of-arrival (DOA) experiment.\n\n1\n\ni=1\n\n(cid:80)n\n\ni2 uiv(cid:62)\n\n4.1 Evaluations via Synthetic Correlated Dictionaries\nTo reproduce experiments from [38], we generate correlated synthetic features via \u03a6 =\ni , where ui \u2208 Rn and vi \u2208 Rm are drawn iid from a unit Gaussian distribution,\nand each column of \u03a6 is subsequently rescaled to unit (cid:96)2 norm. Ground truth samples x\u2217 have\nd nonzero elements drawn randomly from U[\u22120.5, 0.5] excluding the interval [\u22120.1, 0.1]. We use\nn=20, m=100, and vary d, with larger values producing a much harder combinatorial estimation\nproblem (exhaustive search is not feasible here). All algorithms are presented with y and attempt\nto estimate supp[x\u2217]. We evaluate using strict accuracy, meaning percentage of trials with exact\nsupport recovery, and loose accuracy, which quanti\ufb01es the percentage of true positives among the top\nn \u2018guesses\u2019 (i.e., largest predicted outputs).\nFigures 3(a) and 3(b) evaluate our model, averaged across 105 trials, against an array of optimization-\nbased approaches: SBL [33], (cid:96)1 norm minimization [4], and IHT [5]; and existing learning-based\nDNN models: an ISTA-inspired network [20], an IHT-inspired network [34], and the best maximal\nsparsity net (MaxSparseNet) from [38] (detailed settings in the supplementary). With regard to strict\naccuracy, only SBL is somewhat competitive with our approach and other learning-based models\nare much worse; however, using loose accuracy our method is far superior than all others. Note that\nthis is the \ufb01rst approach we are aware of in the literature that can convincingly outperform SBL\nrecovering sparse solutions when a heavily correlated dictionary is present, and we hypothesize that\nthis is largely possible because our design principles were directly inspired by SBL itself.\nTo isolate architectural factors affecting performance we conducted ablation studies: (i) with or\nwithout gated feedback, (iii) LSTM or GRU cells, and (iii) small or large (4\u00d7) model size; for each\nmodel type, the small and respectively large versions have roughly the same number of parameters.\nThe supplementary also contains a much broader set of self-comparison tests. Figure 3(c), which\nshows strict accuracy results with d/n = 0.4, indicates the importance of gated feedback and to a\nlesser degree network size, while LSTM and GRU cells perform similarly as expected.\n\n4.2 Practical Application I: Direction-of-Arrival (DOA) Estimation\nDOA estimation is a fundamental problem in sonar/radar processing [28]. Given an array of n\nomnidirectional sensors with d signal waves impinging upon them, the objective is to estimate the\nangular direction of the wave sources with respect to the sensors. For certain array geometries and\nknown propagation mediums, estimation of these angles can be mapped directly to solving (2) in the\ncomplex domain. In this scenario, the i-th column of \u03a6 represents the sensor array output (a point in\nCn) from a hypothetical source with unit strength at angular location \u03b8i, and can be computed using\nwave progagation formula [28]. The entire dictionary can be constructed by concatenating columns\nassociated with angles forming some spacing of interest, e.g., every 1\u25e6 across a half circle, and will\nbe highly correlated. Given measurements y \u2208 Cn, we can solve (2), with \u03bb re\ufb02ecting the noise level.\n\n8\n\n0.150.20.250.30.350.40.45dn00.10.20.30.40.50.60.70.80.91correctsupportrecoveryOurs-GFLSTMSBLMaxSparseNet\u21131-normIHT(allzero)ISTA-Net(allzero)IHT-Net(allzero)0.150.20.250.30.350.40.45dn0.650.70.750.80.850.90.951supportrecoveryrateOurs-GFLSTMSBLMaxSparseNet\u21131-normIHT(allbelow0.6)ISTA-Net(allbelow0.6)IHT-Net(allbelow0.6)12345678910models00.10.20.30.40.50.6correctsupportrecoveryMaxSparseNetGRU-smallLSTM-bigLSTM-smallGRU-bigGFGRU-smallGFLSTM-smallSBLGFGRU-bigGFLSTM-big1020304050607080SNR(dB)010203040506070chamferdistanceSBLOurs-GFLSTM\fThe indexes of nonzero elements of x\u2217 will then reveal the angular locations/directions of putative\nsources.\nRecently SBL-based algorithms have produced state-of-the-art results solving the DOA problem\n[14, 19, 39], and we compare our approach against SBL here. We apply a typical experimental\ndesign from the literature involving a uniform linear array with n = 10 sensors; see supplementary\nfor background and details on how to compute \u03a6, as well as speci\ufb01cs on how to adapt and train\nour GFLSTM using complex data. Four sources are then placed in random angular locations, with\nnonzero coef\ufb01cients at {\u00b11 \u00b1 i}, and we compute measurements y = \u03a6x\u2217 + \u0001, with \u0001 chosen from\na complex Gaussian distribution to produce different SNR. Because the nonzero positions in x\u2217 now\nhave physical meaning, we apply the Chamfer distance [7] as the error metric, which quanti\ufb01es how\nclose we are to true source locations (lower is better). Figure 3(d) displays the results, where our\nlearned network outperforms SBL across a range of SNR values.\n\nTable 1: Photometric stereo results\n\nAlgorithm\n\nMaxSparseNet\n\nSBL\n\nOurs\n\nAverage angular error (degrees)\nBunny\nr=20\n1.86\n1.95\n1.55\n\nCaesar\nr=20\n2.07\n2.51\n1.80\n\nr=40\n0.50\n1.20\n1.12\n\nr=10\n4.79\n3.51\n2.39\n\nr=10\n4.02\n1.48\n1.35\n\nr=40\n0.34\n1.18\n0.60\n\nRuntime (sec.)\n\nBunny\nr=20\n\nCaesar\nr=20\n\nr=40\n\nr=10\n\nr=10\nr=40\n35.46 22.66 32.20 86.96 64.67 90.48\n0.90\n2.20\n2.08\n0.63\n\n2.13\n1.48\n\n2.12\n1.70\n\n0.92\n0.85\n\n0.87\n0.67\n\n4.3 Practical Application II: 3D Geometry Recovery via Photometric Stereo\nPhotometric stereo represents another application domain whereby approximately solving (2) using\nSBL has recently produced state-of-the-art results [24]. The objective here is to recover the 3D\nsurface normals of a given scene using r images taken from a single camera but with different lighting\nconditions. Under the assumption that these images can be approximately decomposed into a diffuse\nLambertian component and sparse corruptions such as shadows and specular highlights, then surface\nnormals at each pixel can be recovered using (2) to isolate these sparse factors followed by a \ufb01nal\nleast squares post-processing step [24]. In this context, \u03a6 is constructed using the known camera and\nlighting geometry, and y represents intensity measurements for a given pixel across images projected\nonto the nullspace of a special transposed lighting matrix (see supplementary for more details and\nour full experimental design). However, because a sparse regression problem must be computed for\nevery pixel to recovery the full scene geometry, a fast, ef\ufb01cient solver is paramount.\nWe compare our GFLSTM model against both SBL and the MaxSparseNet [38] (both of which\noutperform other existing methods). Tests are performed using the 32-bit HDR gray-scale images\nof objects \u2018Bunny\u2019 (256 \u00d7 256) and \u2018Caesar\u2019 (300 \u00d7 400) as in [24]. For (very) weakly-supervised\ntraining data, we apply the same approach as before, only we use nonzero magnitudes drawn from a\nGaussian, with mean and variance loosely tuned to the photometric stereo data, consistent with [38].\nResults are shown in Table 1, where we observe in all cases the DNN models are faster by a wide\nmargin, and in the hard cases cases (smaller r) our approach produces the lowest angular error. The\nonly exception is with r = 40; however, this is a quite easy scenario with so many images such that\nSBL can readily \ufb01nd a near optimal solution, albeit at a high computational cost. See supplementary\nfor error surface visualizations.\n\n5 Conclusion\nIn this paper we have examined the structural similarities between multi-loop iterative algorithms\nand multi-scale sequence prediction neural networks. This association is suggestive of a learning\nprocess for a richer class of algorithms that employ multiple loops and latent states, such as the EM\nalgorithm or general majorization-minimization approaches. For example, in a narrower sense, we\nhave demonstrated that specialized gated recurrent nets carefully patterned to re\ufb02ect the multi-scale\noptimization trajectories of multi-loop SBL iterations can lead to a considerable boost in both accuracy\nand ef\ufb01ciency. Note that simpler \ufb01rst-order, gradient descent-style algorithms can be ineffective\nwhen applied to sparsity-promoting energy functions with a combinatorial number of bad local\noptima and highly concave or non-differentiable surfaces in the neighborhood of minima. Moreover,\nimplementing smoother approximations such as SBL with gradient descent is impractical since each\ngradient calculation would be prohibitively expensive. Therefore, recent learning-to-learn approaches\nsuch as [1] that explicitly rely on gradient calculations are dif\ufb01cult to apply in the present setting.\n\n9\n\n\fAcknowledgments\nThis work was accomplished while Hao He was an intern at Microsoft Research, Beijing.\n\nReferences\n[1] M. Andrychowicz, M. Denil, S. Gomez, M.W. Hoffman, D. Pfau, T. Schaul, B. Shillingford,\nand N. de Freitas. Learning to learn by gradient descent by gradient descent. arXiv:1606.04474,\n2016.\n\n[2] S. Baillet, J.C. Mosher, and R.M. Leahy. Electromagnetic brain mapping.\n\nProcessing Magazine, pages 14\u201330, Nov. 2001.\n\nIEEE Signal\n\n[3] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image\n\ndenoising and deblurring problems. IEEE Trans. Image Processing, 18(11), 2009.\n\n[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sciences, 2(1), 2009.\n\n[5] T. Blumensath and M.E. Davies. Iterative hard thresholding for compressed sensing. Applied\n\nand Computational Harmonic Analysis, 27(3), 2009.\n\n[6] T. Blumensath and M.E. Davies. Normalized iterative hard thresholding: Guaranteed stability\n\nand performance. IEEE J. Selected Topics Signal Processing, 4(2), 2010.\n\n[7] G. Borgefors. Distance transformations in arbitrary dimensions. Computer Vision, Graphics,\n\nand Image Processing, 27(3):321\u2013345, 1984.\n\n[8] E. Cand\u00e8s, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information. IEEE Trans. Information Theory, 52(2):489\u2013\n509, Feb. 2006.\n\n[9] E. Cand\u00e8s and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory,\n\n51(12), 2005.\n\n[10] E. Cand\u00e8s, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization. J.\n\nFourier Anal. Appl., 14(5):877\u2013905, 2008.\n\n[11] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder-decoder for statistical machine translation. Confer-\nence on Empirical Methods in Natural Language Processing, 2014.\n\n[12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. In\n\nInternational Conference on Machine Learning, 2015.\n\n[13] S.F. Cotter and B.D. Rao. Sparse channel estimation via matching pursuit with application to\n\nequalization. IEEE Trans. on Communications, 50(3), 2002.\n\n[14] J. Dai, X. Bao, W. Xu, and C. Chang. Root sparse Bayesian learning for off-grid DOA estimation.\n\nIEEE Signal Processing Letters, 24(1), 2017.\n\n[15] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the\n\nEM algorithm. J. Royal Statistical Society, Series B (Methodological), 39(1):1\u201338, 1977.\n\n[16] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties.\n\nJ. American Statistical Assoc., 96, 2001.\n\n[17] M.A.T. Figueiredo. Adaptive sparseness using Jeffreys prior. NIPS, 2002.\n[18] F.A. Gers and J. Schmidhuber. Recurrent nets that time and count. International Joint Confer-\n\nence on Neural Networks, 2000.\n\n[19] P. Gerstoft, C.F. Mecklenbrauker, A. Xenaki, and S. Nannuru. Multi snapshot sparse Bayesian\n\nlearning for DOA. IEEE Signal Processing Letters, 23(20), 2016.\n\n10\n\n\f[20] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, 2010.\n[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.\n[22] D.R. Hunter and K. Lange. A tutorial on MM algorithms. American Statistician, 58(1), 2004.\n\n[23] S. Ikehata, D.P. Wipf, Y. Matsushita, and K. Aizawa. Robust photometric stereo using sparse\n\nregression. In Computer Vision and Pattern Recognition, 2012.\n\n[24] S. Ikehata, D.P. Wipf, Y. Matsushita, and K. Aizawa. Photometric stereo using sparse Bayesian\nregression for general diffuse surfaces,. IEEE Trans. Pattern Analysis and Machine Intelligence,\n36(9):1816\u20131831, 2014.\n\n[25] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork RNN.\n\nConference on Machine Learning, 2014.\n\nInternational\n\n[26] D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1992.\n\n[27] D.M. Malioutov, M. \u00c7etin, and A.S. Willsky. Sparse signal reconstruction perspective for source\n\nlocalization with sensor arrays. IEEE Trans. Signal Processing, 53(8), 2005.\n\n[28] D.G. Manolakis, V.K. Ingle, and S.M. Kogon. Statistical and Adaptive Signal Processing.\n\nMcGrall-Hill, Boston, 2000.\n\n[29] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. Interna-\n\ntional Conference on Machine Learning, 2010.\n\n[30] P. Sprechmann, A.M. Bronstein, and G. Sapiro. Learning ef\ufb01cient sparse and low rank models.\n\nIEEE Trans. Pattern Analysis and Machine Intelligence, 37(9), 2015.\n\n[31] B.K. Sriperumbudu and G.R.G. Lanckriet. A proof of convergence of the concave-convex\n\nprocedure using Zangwill\u2019s theory. Neural computation, 24, 2012.\n\n[32] R. Tibshirani. Regression shrinkage and selection via the lasso. J. of the Royal Statistical\n\nSociety, 1996.\n\n[33] M.E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1, 2001.\n\n[34] Z. Wang, Q. Ling, and T. Huang. Learning deep (cid:96)0 encoders. AAAI Conference on Arti\ufb01cial\n\nIntelligence, 2016.\n\n[35] D.P. Wipf. Sparse estimation with structured dictionaries. Advances in Nerual Information\n\nProcessing 24, 2012.\n\n[36] D.P. Wipf and S. Nagarajan. Iterative reweighted (cid:96)1 and (cid:96)2 methods for \ufb01nding sparse solutions.\nJournal of Selected Topics in Signal Processing (Special Issue on Compressive Sensing), 4(2),\nApril 2010.\n\n[37] R.J. Woodham. Photometric method for determining surface orientation from multiple images.\n\nOptical Engineering, 19(1), 1980.\n\n[38] B. Xin, Y. Wang, W. Gao, and D.P. Wipf. Maximal sparsity with deep networks? Advances in\n\nNeural Information Processing Systems 29, 2016.\n\n[39] Z. Yang, L. Xie, and C. Zhang. Off-grid direction of arrival estimation using sparse Bayesian\n\ninference. IEEE Trans. Signal Processing, 61(1):38\u201343, 2013.\n\n[40] A.R. Zamir, T.L. Wu, L. Sun, W. Shen, J. Malik, and S. Savarese. Feedback networks.\n\narXiv:1612.09508, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2863, "authors": [{"given_name": "Hao", "family_name": "He", "institution": "MIT"}, {"given_name": "Bo", "family_name": "Xin", "institution": "Microsoft Research"}, {"given_name": "Satoshi", "family_name": "Ikehata", "institution": "National Institute of Informatics"}, {"given_name": "David", "family_name": "Wipf", "institution": "Microsoft Research"}]}