{"title": "Higher-Order Total Variation Classes on Grids: Minimax Theory and Trend Filtering Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 5800, "page_last": 5810, "abstract": "We consider the problem of estimating the values of a function over $n$ nodes of a $d$-dimensional grid graph (having equal side lengths $n^{1/d}$) from noisy observations. The function is assumed to be smooth, but is allowed to exhibit different amounts of smoothness at different regions in the grid. Such heterogeneity eludes classical measures of smoothness from nonparametric statistics, such as Holder smoothness. Meanwhile, total variation (TV) smoothness classes allow for heterogeneity, but are restrictive in another sense: only constant functions count as perfectly smooth (achieve zero TV). To move past this, we define two new higher-order TV classes, based on two ways of compiling the discrete derivatives of a parameter across the nodes. We relate these two new classes to Holder classes, and derive lower bounds on their minimax errors. We also analyze two naturally associated trend filtering methods; when $d=2$, each is seen to be rate optimal over the appropriate class.", "full_text": "Higher-Order Total Variation Classes on Grids:\nMinimax Theory and Trend Filtering Methods\n\nVeeranjaneyulu Sadhanala\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nvsadhana@cs.cmu.edu\n\nYu-Xiang Wang\n\nCarnegie Mellon University/Amazon AI\nPittsburgh, PA 15213/Palo Alto, CA 94303\n\nyuxiangw@amazon.com\n\nJames Sharpnack\n\nUniversity of California, Davis\n\nDavis, CA 95616\n\njsharpna@ucdavis.edu\n\nRyan J. Tibshirani\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nryantibs@stat.cmu.edu\n\nAbstract\n\nWe consider the problem of estimating the values of a function over n nodes of a\nd-dimensional grid graph (having equal side lengths n1/d) from noisy observations.\nThe function is assumed to be smooth, but is allowed to exhibit different amounts\nof smoothness at different regions in the grid. Such heterogeneity eludes classical\nmeasures of smoothness from nonparametric statistics, such as Holder smoothness.\nMeanwhile, total variation (TV) smoothness classes allow for heterogeneity, but\nare restrictive in another sense: only constant functions count as perfectly smooth\n(achieve zero TV). To move past this, we de\ufb01ne two new higher-order TV classes,\nbased on two ways of compiling the discrete derivatives of a parameter across the\nnodes. We relate these two new classes to Holder classes, and derive lower bounds\non their minimax errors. We also analyze two naturally associated trend \ufb01ltering\nmethods; when d = 2, each is seen to be rate optimal over the appropriate class.\n\n1\n\nIntroduction\n\nIn this work, we focus on estimation of a mean parameter de\ufb01ned over the nodes of a d-dimensional\ngrid graph G = (V, E), with equal side lengths N = n1/d. Let us enumerate V = {1, . . . , n} and\nE = {e1, . . . , em}, and consider data y = (y1, . . . , yn) \u2208 Rn observed over V , distributed as\n\nyi \u223c N (\u03b80,i, \u03c32),\n\n(1)\nwhere \u03b80 = (\u03b80,1, . . . , \u03b80,n) \u2208 Rn is the mean parameter to be estimated, and \u03c32 > 0 the common\nnoise variance. We will assume that \u03b80 displays some kind of regularity or smoothness over G, and\nare speci\ufb01cally interested in notions of regularity built around on the total variation (TV) operator\n\nindependently, for i = 1, . . . , n,\n\n(cid:88)\n\n(i,j)\u2208E\n\n(cid:107)D\u03b8(cid:107)1 =\n\n|\u03b8i \u2212 \u03b8j|,\n\n(2)\n\nde\ufb01ned with respect to G, where D \u2208 Rm\u00d7n is the edge incidence matrix of G, which has (cid:96)th row\nD(cid:96) = (0, . . . ,\u22121, . . . , 1, . . . , 0), with \u22121 in location i and 1 in location j, provided that the (cid:96)th edge\nis e(cid:96) = (i, j) with i < j. There is an extensive literature on estimators based on TV regularization,\nboth in Euclidean spaces and over graphs. Higher-order TV regularization, which, loosely speaking,\nconsiders the TV of derivatives of the parameter, is much less understood, especially over graphs.\nIn this paper, we develop statistical theory for higher-order TV smoothness classes, and we analyze\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fassociated trend \ufb01ltering methods, which are seen to achieve the minimax optimal estimation error\nrate over such classes. This can be viewed as an extension of the work in [22] for the zeroth-order\nTV case, where by \u201czeroth-order\u201d, we refer to the usual TV operator as de\ufb01ned in (2).\n\nMotivation. TV denoising over grid graphs, speci\ufb01cally 1d and 2d grid graphs, is a well-studied\nproblem in signal processing, statistics, and machine learning, some key references being [20, 5, 26].\nGiven data y \u2208 Rn as per the setup described above, the TV denoising or fused lasso estimator over\nthe grid G is de\ufb01ned as\n\n(cid:107)y \u2212 \u03b8(cid:107)2\n\n2 + \u03bb(cid:107)D\u03b8(cid:107)1,\n\n1\n2\n\n\u02c6\u03b8 = argmin\n\u03b8\u2208Rn\n\n(3)\nwhere \u03bb \u2265 0 is a tuning parameter. The TV denoising estimator generalizes seamlessly to arbitrary\ngraphs. The problem of denoising over grids, the setting we focus on, is of particular relevance to a\nnumber of important applications, e.g., in time series analysis, and image and video processing.\nA strength of the nonlinear TV denoising estimator in (3)\u2014where by \u201cnonlinear\u201d, we mean that \u02c6\u03b8 is\nnonlinear as a function of y\u2014is that it can adapt to heterogeneity in the local level of smoothness of\nthe underlying signal \u03b80. Moreover, it adapts to such heterogeneity at an extent that is beyond what\nlinear estimators are capable of capturing. This principle is widely evident in practice and has been\nchampioned by many authors in the signal processing literature. It is also backed by statistical theory,\ni.e., [8, 16, 27] in the 1d setting, and most recently [22] in the general d-dimensional setting.\nNote that the TV denoising estimator \u02c6\u03b8 in (3) takes a piecewise constant structure by design, i.e., at\nmany adjacent pairs (i, j) \u2208 E we will have \u02c6\u03b8i = \u02c6\u03b8j, and this will be generally more common for\nlarger \u03bb. For some problems, this structure may not be ideal and we might instead seek a piecewise\nsmooth estimator, that is still able to cope with local changes in the underlying level of smoothness,\nbut offers a richer structure (beyond a simple constant structure) for the base trend. In a 1d setting,\nthis is accomplished by trend \ufb01ltering methods, which move from piecewise constant to piecewise\npolynomial structure, via TV regularization of discrete derivatives of the parameter [24, 13, 27]. An\nextension of trend \ufb01ltering to general graphs was developed in [31]. In what follows, we study the\nstatistical properties of this graph trend \ufb01ltering method over grids, and we propose and analyze a\nmore specialized trend \ufb01ltering estimator for grids based on the idea that something like a Euclidean\ncoordinate system is available at any (interior) node. See Figure 1 for a motivating illustration.\n\nRelated work. The literature on TV denoising is enormous and we cannot give a comprehensive\nreview, but only some brief highlights. Important methodological and computational contributions\nare found in [20, 5, 26, 4, 10, 6, 28, 15, 7, 12, 1, 25], and notable theoretical contributions are found\nin [16, 19, 9, 23, 11, 22, 17]. The literature on higher-order TV-based methods is more sparse and\nmore concentrated on the 1d setting. Trend \ufb01ltering methods in 1d were pioneered in [24, 13], and\nanalyzed statistically in [27], where they were also shown to be asymptotically equivalent to locally\nadaptive regression splines of [16]. An extension of trend \ufb01ltering to additive models was given in\n[21]. A generalization of trend \ufb01ltering that operates over an arbitrary graph structure was given in\n[31]. Trend \ufb01ltering is not the only avenue for higher-order TV regularization: the signal processing\ncommunity has also studied higher-order variants of TV, see, e.g., [18, 3]. The construction of the\ndiscrete versions of these higher-order TV operators is somewhat similar to that in [31] as well our\nKronecker trend \ufb01ltering proposal, however, the focus of the work is quite different.\n\nSummary of contributions. An overview of our contributions is given below.\n\nfunction of the regularizer evaluated at the mean \u03b80.\n\n\u2022 We propose a new method for trend \ufb01ltering over grid graphs that we call Kronecker trend\n\ufb01ltering (KTF), and compare its properties to the more general graph trend \ufb01ltering (GTF)\nproposal of [31].\n\u2022 For 2d grids, we derive estimation error rates for GTF and KTF, each of these rates being a\n\u2022 For d-dimensional grids, we derive minimax lower bounds for estimation over two higher-\norder TV classes de\ufb01ned using the operators from GTF and KTF. When d = 2, these lower\nbounds match the upper bounds in rate (apart from log factors) derived for GTF and KTF,\nensuring that each method is minimax rate optimal (modulo log factors) for its own notion\nof regularity. Also, the KTF class contains a Holder class of an appropriate order, and KTF\nis seen to be rate optimal (modulo log factors) for this more homogeneous class as well.\n\n2\n\n\fFigure 1: Top left: an underlying signal \u03b80 and associated data y (shown as black points). Top middle and top\nright: Laplacian smoothing \ufb01t to y, at large and small tuning parameter values, respectively. Bottom left, middle,\nand right: TV denoising (3), graph trend \ufb01ltering (5), and Kronecker trend \ufb01ltering (5) \ufb01t to y, respectively (the\nlatter two are of order k = 2, with penalty operators as described in Section 2). In order to capture the larger of\nthe two peaks, Laplacian smoothing must signi\ufb01cantly undersmooth throughout; with more regularization, it\nundersmooths throughout. TV denoising is able to adapt to heterogeneity in the smoothness of the underlying\nsignal, but exhibits \u201cstaircasing\u201d artifacts, as it is restricted to \ufb01tting piecewise constant functions. Graph and\nKronecker trend \ufb01ltering overcome this, while maintaining local adaptivity.\n\nNotation. For deterministic sequences an, bn we write an = O(bn) to denote that an/bn is upper\nbounded for large enough n, and an (cid:16) bn to denote that both an = O(bn) and a\u22121\nn ). For\nrandom sequences An, Bn, we write An = OP(Bn) to denote that An/Bn is bounded in probability.\nGiven a d-dimensional grid G = (V, E), where V = {1, . . . , n}, as before, we will sometimes index\na parameter \u03b8 \u2208 Rn de\ufb01ned over the nodes in the following convenient way. Letting N = n1/d and\nZd = {(i1/N, . . . , id/N ) : i1, . . . , id \u2208 {1, . . . , N}} \u2286 [0, 1]d, we will index the components of \u03b8\nby their lattice positions, denoted \u03b8(x), x \u2208 Zd. Further, for each j = 1, . . . , d, we will de\ufb01ne the\ndiscrete derivative of \u03b8 in the jth coordinate direction at a location x by\n\nn = O(b\u22121\n\n0\n\n(Dxj \u03b8)(x) =\n\n(4)\nNaturally, we denote by Dxj \u03b8 \u2208 Rn the vector with components (Dxj \u03b8)(x), x \u2208 Zd. Higher-order\ndiscrete derivatives are simply de\ufb01ned by repeated application of the above de\ufb01nition. We use ab-\nbreviations (Dx2\n\u03b8)(x) = (Dxj (Dxj \u03b8))(x), for j = 1, . . . , d, and (Dxj ,x(cid:96)\u03b8)(x) = (Dxj (Dx(cid:96)\u03b8))(x),\nfor j, (cid:96) = 1, . . . , d, and so on.\nGiven an estimator \u02c6\u03b8 of the mean parameter \u03b80 in (1), and K \u2286 Rn, two quantities of interest are:\n\nj\n\nif x, x + ej/N \u2208 Zd,\nelse.\n\n(cid:26)\u03b8(x + ej/N ) \u2212 \u03b8(x)\n\nMSE(\u02c6\u03b8, \u03b80) =\n\n(cid:107)\u02c6\u03b8 \u2212 \u03b80(cid:107)2\n\n2\n\n1\nn\n\nand R(K) = inf\n\u02c6\u03b8\n\nsup\n\u03b80\u2208K\n\nThe \ufb01rst quantity here is called the mean squared error (MSE) of \u03b8; we will also call E[MSE(\u02c6\u03b8, \u03b80)]\nthe risk of \u02c6\u03b8. The second quantity is called the minimax risk over K (the in\ufb01mum being taken over\nall estimators \u02c6\u03b8).\n\nE(cid:2)MSE(\u02c6\u03b8, \u03b80)(cid:3).\n\n3\n\nUnderlying signal and datallllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLaplacian smoothing, large lLaplacian smoothing, small lTV denoisingGraph trend filteringKronecker trend filtering\f2 Trend \ufb01ltering methods\n\nReview: graph trend \ufb01ltering. To review the family of estimators developed in [31], we start by\nintroducing a general-form estimator called the generalized lasso signal approximator [28],\n\n(5)\nfor a matrix \u2206 \u2208 Rr\u00d7n, referred to as the penalty operator. For an integer k \u2265 0, the authors [31]\nde\ufb01ned the graph trend \ufb01ltering (GTF) estimator of order k by (5), with the penalty operator being\n\n\u02c6\u03b8 = argmin\n\u03b8\u2208Rn\n\n(cid:107)y \u2212 \u03b8(cid:107)2\n\n2 + \u03bb(cid:107)\u2206\u03b8(cid:107)1,\n\n1\n2\n\n(cid:26)DLk/2\n\nL(k+1)/2\n\n\u2206(k+1) =\n\nfor k even,\nfor k odd.\n\n(6)\n\nHere, as before, we use D for the edge incidence matrix of G. We also use L = DT D for the graph\nLaplacian matrix of G. The intuition behind the above de\ufb01nition is that \u2206(k+1)\u03b8 gives something\nroughly like the (k + 1)st order discrete derivatives of \u03b8 over the graph G.\nNote that the GTF estimator reduces to TV denoising in (3) when k = 0. Also, like TV denoising,\nGTF applies to arbitrary graph structures; see [31] for more details and for the study of GTF over\ngeneral graphs. Our interest is of course its behavior over grids, and we will now use the notation\nintroduced in (4), to shed more light on the GTF penalty operator in (6) over a d-dimensional grid.\ndx, where at all points x \u2208 Zd (except\n\nx\u2208Zd\n\nfor those close to the boundary),\n\nFor any signal \u03b8 \u2208 Rn, we can write (cid:107)\u2206(k+1)\u03b8(cid:107)1 =(cid:80)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:17)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nd(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nd(cid:88)\n\nd(cid:88)\n(cid:16)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nDxj1 ,x2\n\n,x2\nj2\n\n,...,x2\njq\n\nDx2\nj1\n\nj1=1\n\nj2,...,jq=1\n\n,...,x2\njq\n\nj2\n\n\u03b8\n\n(x)\n\n\u03b8\n\n(x)\n\ndx =\n\n(cid:16)\n\n(cid:17)\n\nj1,...,jq=1\n\nfor k even, where q = k/2,\n\nfor k odd, where q = (k + 1)/2.\n\n(7)\n\nWritten in this form, it appears that the GTF operator \u2206(k+1) aggregates derivatives in somewhat of\nan unnatural way. But we must remember that for a general graph structure, only \ufb01rst derivatives and\ndivergences have obvious discrete analogs\u2014given by application of D and L, respectively. Hence,\nGTF, which was originally designed for general graphs, relies on combinations of D and L to produce\nsomething like higher-order discrete derivatives. This explains the form of the aggregated derivatives\nin (6), which is entirely based on divergences.\n\nKronecker trend \ufb01ltering. There is a natural alternative to the GTF penalty operator that takes\nadvantage of the Euclidean-like structure available at the (interior) nodes of a grid graph. At a point\nx \u2208 Zd (not close to the boundary), consider using\n\nas a basic building block for penalizing derivatives, rather than (7). This gives rise to a method we\ncall Kronecker trend \ufb01ltering (KTF), which for an integer order k \u2265 0 is de\ufb01ned by (5), but now with\nthe choice of penalty operator\n\n1d\n\n\u2208 R(N\u2212k\u22121)\u00d7N is the 1d discrete derivative operator of order k + 1 (e.g., as used in\nHere, D(k+1)\nunivariate trend \ufb01ltering, see [27]), I \u2208 RN\u00d7N is the identity matrix, and A \u2297 B is the Kronecker\nproduct of matrices A, B. Each group of rows in (9) features a total of d \u2212 1 Kronecker products.\nKTF reduces to TV denoising in (3) when k = 0, and thus also to GTF with k = 0. But for k \u2265 1,\nGTF and KTF are different estimators. A look at the action of their penalty operators, as displayed in\n\n4\n\n(cid:12)(cid:12)(cid:0)Dxk+1\n\nj\n\n\u03b8(cid:1)(x)(cid:12)(cid:12)\n\nd(cid:88)\n\nj=1\n\ndx =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:101)\u2206(k+1) =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\n1d \u2297 I \u2297 \u00b7\u00b7\u00b7 \u2297 I\nD(k+1)\nI \u2297 D(k+1)\n1d \u2297 \u00b7\u00b7\u00b7 \u2297 I\n...\n\nI \u2297 I \u2297 \u00b7\u00b7\u00b7 \u2297 D(k+1)\n\n1d\n\n(8)\n\n(9)\n\n\f(7), (8) reveals some of their differences. For example, we see that GTF considers mixed derivatives\nof total order k + 1, but KTF only considers directional derivatives of order k + 1 that are parallel to\nthe coordinate axes. Also, GTF penalizes aggregate derivatives (i.e., sums of derivatives), whereas\nKTF penalizes individual ones.\nMore subtle differences between GTF and KTF have to do with the structure of their estimates, as we\ndiscuss next. Another subtle difference lies in how the GTF and KTF operators (6), (9) relate to more\nclassical notions of smoothness, particularly, Holder smoothness. This is covered in Section 4.\n\nStructure of estimates.\nIt is straightforward to see that the GTF operator (6) has a 1-dimensional\nnull space, spanned by 1 = (1, . . . , 1) \u2208 Rn. This means that GTF lets constant signals pass through\nunpenalized, but nothing else; or, in other words, it preserves the projection of y onto the space of\nconstant signals, \u00afy1, but nothing else. The KTF operator, meanwhile, has a much richer null space.\nLemma 1. The null space of the KTF operator (9) has dimension (k + 1)d, and it is spanned by a\npolynomial basis made up of elements\n\np(x) = xa1\n\n2 \u00b7\u00b7\u00b7 xad\nd ,\n\n1 xa2\n\nx \u2208 Zd,\n\nwhere a1, . . . , ad \u2208 {0, . . . , k}.\nThe proof is elementary and (as with all proofs in this paper) is given in the supplement. The lemma\nshows that KTF preserves the projection of y onto the space of polynomials of max degree k, i.e., lets\nmuch more than just constant signals pass through unpenalized.\nBeyond the differences in these base trends (represented by their null spaces), GTF and KTF admit\nestimates with similar but generally different structures. KTF has the advantage that this structure is\nmore transparent: its estimates are piecewise polynomial functions of max degree k, with generally\nfewer pieces for larger \u03bb. This is demonstrated by a functional representation for KTF, given next.\nLemma 2. Let hi : [0, 1] \u2192 R, i = 1, . . . , N be the (univariate) falling factorial functions [27, 30]\nof order k, de\ufb01ned over knots 1/N, 2/N, . . . , N:\n\nhi(t) =\n\n(t \u2212 t(cid:96)),\n\nt \u2208 [0, 1], i = 1, . . . , k + 1,\n\ni\u22121(cid:89)\n\n(cid:96)=1\n\n(cid:19)\n\n(cid:26)\n\n(cid:27)\n\n(cid:18)\n\nk(cid:89)\n\n(cid:96)=1\n\nhi+k+1(t) =\n\nt \u2212 i + (cid:96)\nN\n\n\u00b7 1\n\nt >\n\ni + k\n\nN\n\nt \u2208 [0, 1], i = 1, . . . , N \u2212 k \u2212 1.\n\n,\n\n(10)\n\nN(cid:88)\n\n(cid:88)\n\nx\u2208Zd\n\n(For k = 0, our convention is for the empty product to equal 1.) Let Hd be the space spanned by all\nd-wise tensor products of falling factorial functions, i.e., Hd contains f : [0, 1]d \u2192 R of the form\n\nf (x) =\n\ni1,...,id=1\n\n\u03b1i1,...,id hi1(x)hi2 (x2)\u00b7\u00b7\u00b7 hid (xd),\n\nx \u2208 [0, 1]d,\n\nfor coef\ufb01cients \u03b1 \u2208 Rn (whose components we index by \u03b1i1,...,id, for i1, . . . , id = 1, . . . , N). Then\nthe KTF estimator de\ufb01ned in (5), (9) is equivalent to the functional optimization problem\n\n(cid:0)y(x) \u2212 f (x)(cid:1)2\n\nd(cid:88)\n\n(cid:88)\n\nj=1\n\nx\u2212j\u2208Zd\u22121\n\n(cid:18) \u2202kf (\u00b7, x\u2212j)\n\n(cid:19)\n\n\u2202xk\nj\n\n+ \u03bb\n\nTV\n\n,\n\n(11)\n\n\u02c6f = argmin\nf\u2208Hd\n\n1\n2\n\nwhere f (\u00b7, x\u2212j) denotes f as function of the jth dimension with all other dimensions \ufb01xed at x\u2212j,\nj (\u00b7) denotes the kth partial weak derivative operator with respect to xj, for j = 1, . . . , d, and\n\u2202k/\u2202xk\nTV(\u00b7) denotes the total variation operator. The discrete (5), (9) and functional (11) representations\nare equivalent in that \u02c6f and \u02c6\u03b8 match at all grid locations x \u2208 Zd.\nAside from shedding light on the structure of KTF solutions, the functional optimization problem in\n(11) is of practical importance: the function \u02c6f is de\ufb01ned over all of [0, 1]d (as opposed to \u02c6\u03b8, which\nis of course only de\ufb01ned on the grid Zd) and thus we may use it to interpolate the KTF estimate to\nnon-grid locations. It is not clear to us that a functional representation as in (11) (or even a sensible\ninterpolation strategy) is available for GTF on d-dimensional grids.\n\n5\n\n\f3 Upper bounds on estimation error\n\nThen for \u03bb (cid:16) \u00b5\n\nIn this section, we assume that d = 2, and derive upper bounds on the estimation error of GTF and\nKTF for 2d grids. Upper bounds for generalized lasso estimators were studied in [31], and we will\nleverage one of their key results, which is based on what these authors call incoherence of the left\nsingular vectors of the penalty operator \u2206. A slightly re\ufb01ned version of this result is stated below.\nTheorem 1 (Theorem 6 in [31]). Suppose that \u2206 \u2208 Rr\u00d7n has rank q, and denote by \u03be1 \u2264 . . . \u2264 \u03beq\nits nonzero singular values. Also let u1, . . . , uq be the corresponding left singular vectors. Assume\nthat these vectors, except for the \ufb01rst i0, are incoherent, meaning that for a constant \u00b5 \u2265 1,\n\n(cid:113)\n(log r/n)(cid:80)q\n\u221a\n(cid:107)ui(cid:107)\u221e \u2264 \u00b5/\n(cid:32)\ni=i0+1 \u03be\u22122\n\ni\n\nnullity(\u2206)\n\nMSE(\u02c6\u03b8, \u03b80) = OP\n\nn,\n\ni = i0 + 1, . . . , q,\n\n(cid:118)(cid:117)(cid:117)(cid:116) log r\n\nn\n\nq(cid:88)\n\n+\n\n+\n\ni0\nn\n\n\u00b5\nn\n\nn\n\n(cid:33)\n\n\u00b7 (cid:107)\u2206\u03b80(cid:107)1\n\n.\n\n1\n\u03be2\ni\n\n, the generalized lasso estimator \u02c6\u03b8 in (5) satis\ufb01es\n\ni=i0+1\n\ni=i0+1 \u03be\u22122\n\nestablishing incoherence of the singular vectors.\n\nFor GTF and KTF, we will apply this result, balancing an appropriate choice of i0 with the partial\n. The main challenge, as we will see, is in\n\nsum of reciprocal squared singular values(cid:80)q\ntheir Corollary 8) can be re\ufb01ned using a tighter upper bound for the partial sum term(cid:80)q\n\nError bounds for graph trend \ufb01ltering. The authors in [31] have already used Theorem 1 (their\nTheorem 6) in order to derive error rates for GTF on 2d grids. However, their results (speci\ufb01cally,\ni=i0+1 \u03be\u22122\n.\nNo real further tightening is possible, since, as we show later, the results below match the minimax\nlower bound in rate, up to log factors.\nTheorem 2. Assume that d = 2. For k = 0, Cn = (cid:107)\u2206(1)\u03b80(cid:107)1 (i.e., Cn equal to the TV of \u03b80, as in\n(2)), and \u03bb (cid:16) log n, the GTF estimator in (5), (6) (i.e., the TV denoising estimator in (3)) satis\ufb01es\n\ni\n\ni\n\n(cid:18) 1\n\n(cid:19)\n\n+\n\nlog n\n\nn\n\nCn\n\n.\n\nMSE(\u02c6\u03b8, \u03b80) = OP\n\nn\nFor any integer k \u2265 1, Cn = (cid:107)\u2206(k+1)\u03b80(cid:107)1 and \u03bb (cid:16) n\n+ n\u2212 2\n\nMSE(\u02c6\u03b8, \u03b80) = OP\n\n(cid:18) 1\n\nk\nk+2 (log n)\n\n1\nk+2 C\n\n, GTF satis\ufb01es\n\n\u2212 k\nn\n\n(cid:19)\n\nk+2\n\nk+2 (log n)\n\n2\n\n1\nk+2\nk+2 C\nn\n\n.\n\nn\n\nRemark 1. The result for k = 0 in Theorem 2 was essentially already established by [11] (a small\ndifference is that the above rate is sharper by a factor of log n; though to be fair, [11] also take into\naccount (cid:96)0 sparsity). It is interesting to note that the case k = 0 appears to be quite special, in that\nthe GTF estimator, i.e., TV denoising estimator, is adaptive to the underlying smoothness parameter\nCn (the prescribed choice of tuning parameter \u03bb (cid:16) log n does not depend on Cn).\n\nThe technique for upper bounding(cid:80)q\n4 sin2 \u03c0(i1 \u2212 1)\n\n(cid:18)\n\n2N\n\ni\n\ni=i0+1 \u03be\u22122\n+ 4 sin2 \u03c0(i2 \u2212 1)\n\n2N\n\n(cid:19)k+1\n\nas follows. The GTF operator \u2206(k+1) on a 2d grid has squared singular values:\n\nin the proof of Theorem 2 can be roughly explained\n\n,\n\ni1, i2 = 1, . . . , N.\n\nWe can upper bound the sum of squared reciprocal singular values with a integral over [0, 1]2, make\nuse of the identity sin x \u2265 x/2 for small enough x, and then switch to polar coordinates to calculate\nthe integral (similar to [11], in analyzing TV denoising). The arguments to verify incoherence of the\nleft singular vectors of \u2206(k+1) are themselves somewhat delicate, but were already given in [31].\n\nError bounds for Kronecker trend \ufb01ltering.\nIn comparison to the GTF case, the application of\nTheorem 1 to KTF is a much more dif\ufb01cult task, because (to the best of our knowledge) the KTF\n\noperator (cid:101)\u2206(k+1) does not admit closed-form expressions for its singular values and vectors. This\n\nis true in any dimension (i.e., even for d = 1, where KTF reduces to univariate trend \ufb01ltering). As\nit turns out, the singular values can be handled with a relatively straightforward application of the\nCauchy interlacing theorem. It is establishing the incoherence of the singular vectors that proves to\nbe the real challenge. This is accomplished by leveraging specialized approximation bounds for the\neigenvectors of Toeplitz matrices from [2].\n\n6\n\n\fTheorem 3. Assume that d = 2. For k = 0, since KTF reduces to the GTF with k = 0 (and to TV\ndenoising), it satis\ufb01es the result stated in the \ufb01rst part of Theorem 2.\n\nFor any integer k \u2265 1, Cn = (cid:107)(cid:101)\u2206(k+1)\u03b80(cid:107)1 and \u03bb (cid:16) n\n\n, the KTF estimator in\n\nk\n\nk+2 (log n)\n\n1\n\nk+2 C\n\n\u2212 k\nn\n\nk+2\n\n(5), (9) satis\ufb01es\n\nMSE(\u02c6\u03b8, \u03b80) = OP\n\n+ n\u2212 2\n\nk+2 (log n)\n\n2\n\n1\n\nk+2\nk+2 C\nn\n\n(cid:19)\n\n.\n\n(cid:18) 1\n\nn\n\nThe results in Theorems 2 and 3 match, in terms of their dependence on n, k, d and the smoothness\nparameter Cn. As we will see in the next section, the smoothness classes de\ufb01ned by the GTF and\nKTF operators are similar, though not exactly the same, and each GTF and KTF is minimax rate\noptimal with respect to its own smoothness class, up to log factors.\nBeyond 2d? To analyze GTF and KTF on grids of dimension d \u2265 3, we would need to establish\nincoherence of the left singular vectors of the GTF and KTF operators. This should be possible by\nextending the arguments given in [31] (for GTF) and in the proof of Theorem 3 (for KTF), and is left\nto future work.\n\n4 Lower bounds on estimation error\n\nWe present lower bounds on the minimax estimation error over smoothness classes de\ufb01ned by the\noperators from GTF (6) and KTF (9), denoted\n\n(cid:101)T k\nd (Cn) = {\u03b8 \u2208 Rn : (cid:107)(cid:101)\u2206(k+1)\u03b8(cid:107)1 \u2264 Cn},\nd (Cn) = {\u03b8 \u2208 Rn : (cid:107)\u2206(k+1)\u03b8(cid:107)1 \u2264 Cn},\nT k\n\n(12)\n(13)\nrespectively (where the subscripts mark the dependence on the dimension d of the underlying grid\ngraph). Before we derive such lower bounds, we examine embeddings of (the discretization of) the\nclass of Holder smooth functions into the GTF and KTF classes, both to understand the nature of\nthese new classes, and to de\ufb01ne what we call a \u201ccanonical\u201d scaling for the radius parameter Cn.\nEmbedding of Holder spaces and canonical scaling. Given an integer k \u2265 0 and L > 0, recall\nthat the Holder class H(k + 1, L; [0, 1]d) contains k times differentiable functions f : [0, 1]d \u2192 R,\nsuch that for all integers \u03b11, . . . , \u03b1d \u2265 0 with \u03b11 + \u00b7\u00b7\u00b7 + \u03b1d = k,\n\n\u2202kf (x)\n1 \u00b7\u00b7\u00b7 \u2202x\u03b1d\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 L(cid:107)x \u2212 z(cid:107)2,\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n(L) =(cid:8)\u03b8 \u2208 Rn : \u03b8(x) = f (x), x \u2208 Zd, for some f \u2208 H(k + 1, L; [0, 1]d)(cid:9).\n\nfor all x, z \u2208 [0, 1]d.\n\n\u2202kf (z)\n1 \u00b7\u00b7\u00b7 \u2202x\u03b1d\n\nTo compare Holder smoothness with the GTF and KTF classes de\ufb01ned in (12), (13), we discretize\nthe class H(k + 1, L; [0, 1]d) by considering function evaluations over the grid Zd, de\ufb01ning\n\n(14)\nNow we ask: how does the (discretized) Holder class in (14) compare to the GTF and KTF classes\nin (12), (13)? Beginning with a comparison to KTF, \ufb01x \u03b8 \u2208 Hk+1\n(L), corresponding to evaluations\nof f \u2208 H(k + 1, L; [0, 1]d), and consider a point x \u2208 Zd that is away from the boundary. Then the\n\nHk+1\n\n\u2202x\u03b11\n\n\u2202x\u03b11\n\n\u2212\n\nd\n\nd\n\nd\n\nd\n\nKTF penalty at x is(cid:12)(cid:12)(cid:0)Dxk+1\n\nj\n\n\u03b8(cid:1)(x)(cid:12)(cid:12) =(cid:12)(cid:12)(cid:0)Dxk\n\u03b8(cid:1)(x + ej/N ) \u2212(cid:0)Dxk\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202k\n\nf (x + ej/N ) \u2212 \u2202k\n\u2202xk\nj\n\n\u2264 N k\n\u2264 LN k\u22121 + cLN k\u22121.\n\n\u2202xk\nj\n\nj\n\nj\n\n\u03b8(cid:1)(x)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12) + N k\u03b4(N )\n\nf (x)\n\n(15)\nIn the second line above, we de\ufb01ne \u03b4(N ) to be the sum of absolute errors in the discrete approxima-\ntions to the partial derivatives (i.e., the error in approximating \u2202kf (x)/\u2202xk\n\u03b8)(x)/N k, and\nsimilarly at x + ej/N). In the third line, we use Holder smoothness to upper bound the \ufb01rst term,\nand we use standard numerical analysis (details in the supplement) for the second term to ensure that\n\u03b4(N ) \u2264 cL/N for a constant c > 0 depending only on k. Summing the bound in (15) over x \u2208 Zd\nas appropriate gives a uniform bound on the KTF penalty at \u03b8, and leads to the next result.\n\nj by (Dxk\n\nj\n\n7\n\n\f(L) \u2286 (cid:101)T k\n\nd\n\nd\n\n(1) \u2286 (cid:101)T k\n\nd (cLn1\u2212(k+1)/d), where c > 0 is a constant depending only on k.\n\nLemma 3. For any integers k \u2265 0, d \u2265 1, the (discretized) Holder and KTF classes de\ufb01ned in (14),\n(13) satisfy Hk+1\nThis lemma has three purposes. First, it provides some supporting evidence that the KTF class is an\ninteresting smoothness class to study, as it shows the KTF class contains (discretizations of) Holder\nsmooth functions, which are a cornerstone of classical nonparametric regression theory. In fact, this\ncontainment is strict and the KTF class contains more heterogeneous functions in it as well. Second,\nit leads us to de\ufb01ne what we call the canonical scaling Cn (cid:16) n1\u2212(k+1)/d for the radius of the KTF\nclass (13). This will be helpful for interpreting our minimax lower bounds in what follows; at this\nscaling, note that we have Hk+1\nd (Cn). Third and \ufb01nally, it gives us an easy way to establish\nlower bounds on the minimax estimation error over KTF classes, by invoking well-known results on\nminimax rates for Holder classes. This will be described shortly.\nAs for GTF, calculations similar to (15) are possible, but complications ensue for x on the boundary\nof the grid Zd. Importantly, unlike the KTF penalty, the GTF penalty includes discrete derivatives at\nthe boundary and so these complications have serious consequences, as stated next.\nLemma 4. For any integers k, d \u2265 1, there are elements in the (discretized) Holder class Hk+1\nin (14) that do not lie in the GTF class T k\nThis lemma reveals a very subtle drawback of GTF caused by the use of discrete derivatives at the\nboundary of the grid. The fact that GTF classes do not contain (discretized) Holder classes makes\nthem seem less natural (and perhaps, in a sense, less interesting) than KTF classes. In addition, it\nmeans that we cannot use standard minimax theory for Holder classes to establish lower bounds for\nthe estimation error over GTF classes. However, as we will see next, we can construct lower bounds\nfor GTF classes via another (more purely geometric) embedding strategy; interestingly, the resulting\nrates match the Holder rates, suggesting that, while GTF classes do not contain all (discretized)\nHolder functions, they do contain \u201cenough\u201d of these functions to admit the same lower bound rates.\n\nd (Cn) in (12) for arbitrarily large Cn.\n\n(1)\n\nd\n\nMinimax rates for GTF and KTF classes. Following from classical minimax theory for Holder\nclasses [14, 29], and Lemma 3, we have the following result for the minimax rates over KTF classes.\nTheorem 4. For any integers k \u2265 0, d \u2265 1, the KTF class de\ufb01ned in (13) has minimax estimation\nerror satisfying\n\nd (Cn)(cid:1) = \u2126(n\u2212 2d\nR(cid:0)(cid:101)T k\n\n2k+2+d C\n\n2d\n\n2k+2+d\nn\n\n).\n\nFor GTF classes, we use a different strategy. We embed an ellipse, then rotate the parameter space\nand embed a hypercube, leading to the following result.\nTheorem 5. For any integers k \u2265 0, d \u2265 1, the GTF class de\ufb01ned in (12) has minimax estimation\nerror satisfying\n\nR(cid:0)T k\nd (Cn)(cid:1) = \u2126(n\u2212 2d\n\n2k+2+d C\n\n2d\n\n2k+2+d\nn\n\n).\n\nSeveral remarks are in order.\nRemark 2. Plugging in the canonical scaling Cn (cid:16) n1\u2212(k+1)/d in Theorems 4 and 5, we see that\n\nR((cid:101)T k\nd (Cn)) = \u2126(n\u2212 2k+2\n\n2k+2+d )\n\nand R(T k\n\nd (Cn)) = \u2126(n\u2212 2k+2\n\n2k+2+d ),\n\nd\n\nboth matching the usual rate for the Holder class Hk+1\n(1). For KTF, this should be expected, as its\nlower bound is constructed via the Holder embedding given in Lemma 3. But for GTF, it may come\nas somewhat of a surprise\u2014despite the fact it does not embed a Holder class as in Lemma 4, we see\nthat the GTF class shares the same rate, suggesting it still contains something like \u201chardest\u201d Holder\nsmooth signals.\nRemark 3. For d = 2 and all k \u2265 0, we can certify that the lower bound rate in Theorem 4 is tight,\nmodulo log factors, by comparing it to the upper bound in Theorem 3. Likewise, we can certify that\nthe lower bound rate in Theorem 5 is tight, up to log factors, by comparing it to the upper bound in\nTheorem 2. For d \u2265 3, the lower bound rates in Theorems 4 and 5 will not be tight for some values\nof k. For example, when k = 0, at the canonical scaling Cn (cid:16) n1\u22121/d, the lower bound rate (given\nby either theorem) is n\u22122/(2+d), however, [22] prove that the minimax error of the TV class scales\n(up to log factors) as n\u22121/d for d \u2265 2, so we see there is a departure in the rates for d \u2265 3.\n\n8\n\n\fFigure 2: Illustration of the two higher-order TV classes, namely the GTF and KTF classes, as they relate to\nthe (discretized) Holder class. The horizontally/vertically checkered region denotes the part of Holder class not\ncontained in the GTF class. As explained in Section 4, this is due to the fact that the GTF operator penalizes\ndiscrete derivatives on the boundary of the grid graph. The diagonally checkered region (also colored in blue)\ndenotes the part of the Holder class contained in the GTF class. The minimax lower bound rates we derive for\nthe GTF class in Theorem 5 match the well-known Holder rates, suggesting that this region is actually sizeable\nand contains the \u201chardest\u201d Holder smooth signals.\n\nIn general, we conjecture that the Holder embedding for the KTF class (and ellipse embedding for\nGTF) will deliver tight lower bound rates, up to log factors, when k is large enough compared to d.\nThis would have interesting implications for adaptivity to smoother signals (see the next remark); a\nprecise study will be left to future work, along with tight minimax lower bounds for all k, d.\nRemark 4. Again by comparing Theorems 3 and 4, as well as Theorems 2 and 5, we \ufb01nd that, for\nd = 2 and all k \u2265 0, KTF is rate optimal for the KTF smoothness class and GTF is rate optimal for\nthe GTF smoothness class, modulo log factors. We conjecture that this will continue to hold for all\nd \u2265 3, which will be examined in future work. Moreover, an immediate consequence of Theorem 3\nand the Holder embedding in Lemma 3 is that KTF adapts automatically to Holder smooth signals,\ni.e., it achieves a rate (up to log factors) of n\u2212(k+1)/(k+2) over Hk+1\n(1), matching the well-known\nminimax rate for the more homogeneous Holder class. It is not clear that GTF shares this property.\n\n2\n\n5 Discussion\n\nIn this paper, we studied two natural higher-order extensions of the TV estimator on d-dimensional\ngrid graphs. The \ufb01rst was graph trend \ufb01ltering (GTF) as de\ufb01ned in [31], applied to grids; the second\nwas a new Kronecker trend \ufb01ltering (KTF) method, which was built with the special (Euclidean-like)\nstructure of grids in mind. GTF and KTF exhibit some similarities, but are different in important\nways. Notably, the notion of smoothness de\ufb01ned using the KTF operator is somewhat more natural,\nand is a strict generalization of the standard notion of Holder smoothness (in the sense that the KTF\nsmoothness class strictly contains a Holder class of an appropriate order). This is not true for the\nnotion of smoothness de\ufb01ned using the GTF operator. Figure 2 gives an illustration.\nWhen d = 2, we derived tight upper bounds for the estimation error achieved by the GTF and KTF\nestimators\u2014tight in the sense that these upper bound match in rate (modulo log factors) the lower\nbounds on the minimax estimation errors for the GTF and KTF classes. We constructed the lower\nbound for the KTF class by leveraging the fact that it embeds a Holder class; for the GTF class, we\nused a different (more geometric) embedding. While these constructions proved to be tight for d = 2\nand all k \u2265 0, we suspect this will no longer be the case in general, when d is large enough relative\nto k. We will examine this in future work, along with upper bounds for GTF and KTF when d \u2265 3.\nAnother important consideration for future work are the minimax linear rates over GTF and KTF\nclasses, i.e., minimax rates when we restrict our attention to linear estimators. We anticipate that a\ngap will exist between minimax linear and nonlinear rates for all k, d (as it does for k = 0, as shown\nin [22]). This would, e.g., provide some rigorous backing to the claim that the KTF class is larger\nthan its embedded Holder class (the latter having matching minimax linear and nonlinear rates).\n\nAcknowledgements. We thank Sivaraman Balakrishnan for helpful discussions regarding minimax\nrates for Holder classes on grids. JS was supported by NSF Grant DMS-1712996. VS, YW, and RT\nwere supported by NSF Grants DMS-1309174 and DMS-1554123.\n\n9\n\nGTF classKTF classH\u00f6lder class\fReferences\n[1] Alvaro Barbero and Suvrit Sra. Modular proximal optimization for multidimensional total-\n\nvariation regularization. arXiv: 1411.0589, 2014.\n\n[2] Johan M. Bogoya, Albrecht Bottcher, Sergei M. Grudsky, and Egor A. Maximenko. Eigenvectors\nof Hermitian Toeplitz matrices with smooth simple-loop symbols. Linear Algebra and its\nApplications, 493:606\u2013637, 2016.\n\n[3] Kristian Bredies, Karl Kunisch, and Thomas Pock. Total generalized variation. SIAM Journal\n\non Imaging Sciences, 3(3):492\u2013526, 2010.\n\n[4] Antonin Chambolle and Jerome Darbon. On total variation minimization and surface evolution\nusing parametric maximum \ufb02ows. International Journal of Computer Vision, 84:288\u2013307,\n2009.\n\n[5] Antonin Chambolle and Pierre-Louis Lions. Image recovery via total variation minimization\n\nand related problems. Numerische Mathematik, 76(2):167\u2013188, 1997.\n\n[6] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems\nwith applications to imaging. Journal of Mathematical Imaging and Vision, 40:120\u2013145, 2011.\n\n[7] Laurent Condat. A direct algorithm for 1d total variation denoising. HAL: 00675043, 2012.\n\n[8] David L. Donoho and Iain M. Johnstone. Minimax estimation via wavelet shrinkage. Annals of\n\nStatistics, 26(8):879\u2013921, 1998.\n\n[9] Zaid Harchaoui and Celine Levy-Leduc. Multiple change-point estimation with a total variation\n\npenalty. Journal of the American Statistical Association, 105(492):1480\u20131493, 2010.\n\n[10] Holger Hoe\ufb02ing. A path algorithm for the fused lasso signal approximator. Journal of Compu-\n\ntational and Graphical Statistics, 19(4):984\u20131006, 2010.\n\n[11] Jan-Christian Hutter and Philippe Rigollet. Optimal rates for total variation denoising. Annual\n\nConference on Learning Theory, 29:1115\u20131146, 2016.\n\n[12] Nicholas Johnson. A dynamic programming algorithm for the fused lasso and l0-segmentation.\n\nJournal of Computational and Graphical Statistics, 22(2):246\u2013260, 2013.\n\n[13] Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. (cid:96)1 trend \ufb01ltering.\n\nSIAM Review, 51(2):339\u2013360, 2009.\n\n[14] Aleksandr P. Korostelev and Alexandre B. Tsybakov. Minimax Theory of Image Reconstructions.\n\nSpringer, 2003.\n\n[15] Arne Kovac and Andrew Smith. Nonparametric regression on a graph. Journal of Computational\n\nand Graphical Statistics, 20(2):432\u2013447, 2011.\n\n[16] Enno Mammen and Sara van de Geer. Locally apadtive regression splines. Annals of Statistics,\n\n25(1):387\u2013413, 1997.\n\n[17] Oscar Hernan Madrid Padilla, James Sharpnack, James Scott, , and Ryan J. Tibshirani. The\n\nDFS fused lasso: Linear-time denoising over general graphs. arXiv: 1608.03384, 2016.\n\n[18] Christiane Poschl and Otmar Scherzer. Characterization of minimizers of convex regularization\nfunctionals. In Frames and Operator Theory in Analysis and Signal Processing, volume 451,\npages 219\u2013248. AMS eBook Collections, 2008.\n\n[19] Alessandro Rinaldo. Properties and re\ufb01nements of the fused lasso. Annals of Statistics, 37(5):\n\n2922\u20132952, 2009.\n\n[20] Leonid I. Rudin, Stanley Osher, and Emad Faterni. Nonlinear total variation based noise removal\n\nalgorithms. Physica D: Nonlinear Phenomena, 60(1):259\u2013268, 1992.\n\n[21] Veeranjaneyulu Sadhanala and Ryan J. Tibshirani. Additive models via trend \ufb01ltering. arXiv:\n\n1702.05037, 2017.\n\n10\n\n\f[22] Veeranjaneyulu Sadhanala, Yu-Xiang Wang, and Ryan J. Tibshirani. Total variation classes\nbeyond 1d: Minimax rates, and the limitations of linear smoothers. Advances in Neural\nInformation Processing Systems, 29, 2016.\n\n[23] James Sharpnack, Alessandro Rinaldo, and Aarti Singh. Sparsistency via the edge lasso.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 15, 2012.\n\n[24] Gabriel Steidl, Stephan Didas, and Julia Neumann. Splines in higher order TV regularization.\n\nInternational Journal of Computer Vision, 70(3):214\u2013255, 2006.\n\n[25] Wesley Tansey and James Scott. A fast and \ufb02exible algorithm for the graph-fused lasso. arXiv:\n\n1505.06475, 2015.\n\n[26] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and\nsmoothness via the fused lasso. Journal of the Royal Statistical Society: Series B, 67(1):91\u2013108,\n2005.\n\n[27] Ryan J. Tibshirani. Adaptive piecewise polynomial estimation via trend \ufb01ltering. Annals of\n\nStatistics, 42(1):285\u2013323, 2014.\n\n[28] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Annals of\n\nStatistics, 39(3):1335\u20131371, 2011.\n\n[29] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n\n[30] Yu-Xiang Wang, Alexander Smola, and Ryan J. Tibshirani. The falling factorial basis and its\n\nstatistical applications. International Conference on Machine Learning, 31, 2014.\n\n[31] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan J. Tibshirani. Trend \ufb01ltering on\n\ngraphs. Journal of Machine Learning Research, 17(105):1\u201341, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2961, "authors": [{"given_name": "Veeranjaneyulu", "family_name": "Sadhanala", "institution": "CMU"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "CMU / Amazon AI"}, {"given_name": "James", "family_name": "Sharpnack", "institution": "UC Davis"}, {"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}