{"title": "Stochastic Chebyshev Gradient Descent for Spectral Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 7386, "page_last": 7396, "abstract": "A large class of machine learning techniques requires the solution of optimization problems involving spectral functions of parametric matrices, e.g. log-determinant and nuclear norm. Unfortunately, computing the gradient of a spectral function is generally of cubic complexity, as such gradient descent methods are rather expensive for optimizing objectives involving the spectral function. Thus, one naturally turns to stochastic gradient methods in hope that they will provide a way to reduce or altogether avoid the computation of full gradients. However, here a new challenge appears: there is no straightforward way to compute unbiased stochastic gradients for spectral functions. In this paper, we develop unbiased stochastic gradients for spectral-sums, an important subclass of spectral functions. Our unbiased stochastic gradients are based on combining randomized trace estimators with stochastic truncation of the Chebyshev expansions. A careful design of the truncation distribution allows us to offer distributions that are variance-optimal, which is crucial for fast and stable convergence of stochastic gradient methods. We further leverage our proposed stochastic gradients to devise stochastic methods for objective functions involving spectral-sums, and rigorously analyze their convergence rate. The utility of our methods is demonstrated in numerical experiments.", "full_text": "Stochastic Chebyshev Gradient Descent\n\nfor Spectral Optimization\n\nInsu Han1, Haim Avron2 and Jinwoo Shin1,3\n\n1School of Electrical Engineering, Korea Advanced Institute of Science and Technology\n\n2Department of Applied Mathematics, Tel Aviv University\n\n3AItrics\n{insu.han,jinwoos}@kaist.ac.kr\n\nhaimav@post.tau.ac.il\n\nAbstract\n\nA large class of machine learning techniques requires the solution of optimization\nproblems involving spectral functions of parametric matrices, e.g. log-determinant\nand nuclear norm. Unfortunately, computing the gradient of a spectral function\nis generally of cubic complexity, as such gradient descent methods are rather\nexpensive for optimizing objectives involving the spectral function. Thus, one\nnaturally turns to stochastic gradient methods in hope that they will provide a way\nto reduce or altogether avoid the computation of full gradients. However, here\na new challenge appears: there is no straightforward way to compute unbiased\nstochastic gradients for spectral functions. In this paper, we develop unbiased\nstochastic gradients for spectral-sums, an important subclass of spectral functions.\nOur unbiased stochastic gradients are based on combining randomized trace esti-\nmators with stochastic truncation of the Chebyshev expansions. A careful design\nof the truncation distribution allows us to offer distributions that are variance-\noptimal, which is crucial for fast and stable convergence of stochastic gradient\nmethods. We further leverage our proposed stochastic gradients to devise stochastic\nmethods for objective functions involving spectral-sums, and rigorously analyze\ntheir convergence rate. The utility of our methods is demonstrated in numerical\nexperiments.\n\n1\n\nIntroduction\n\nA large class of machine learning techniques involves spectral optimization problems of the form,\n\nmin\n\u03b8\u2208C F (A(\u03b8)) + g(\u03b8),\n\n(1)\nwhere C is some \ufb01nite-dimensional parameter space, A is a function that maps a parameter vector\n\u03b8 to a symmetric matrix A(\u03b8), F is a spectral function (i.e., a real-valued function on symmetric\nmatrices that depends only on the eigenvalues of the input matrix), and g : C \u2192 R. Examples include\nhyperparameter learning in Gaussian process regression with F (X) = log det X [22], nuclear norm\n\nregularization with F (X) = tr(cid:0)X 1/2(cid:1) [20], phase retrieval with F (X) = tr (X) [8], and quantum\n\nstate tomography with F (X) = tr (X log X) [15]. In the aforementioned applications, the main\ndif\ufb01culty in solving problems of the form (1) is in ef\ufb01ciently addressing the spectral component\nF (A(\u00b7)). While explicit formulas for the gradients of spectral functions can be derived [17], it is\ntypically computationally expensive. For example, for F (X) = log det X and A(\u03b8) \u2208 Rd\u00d7d, the\nexact computation of \u2207\u03b8F (A(\u03b8)) can take as much as O(d3k), where k is the number of parameters\nin \u03b8. Therefore, it is desirable to avoid computing, or at the very least reduce the number of times we\ncompute, the gradient of F (A(\u03b8)) exactly.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIt is now well appreciated in the machine learning literature that the use of stochastic gradients is\neffective in alleviating costs associated with expensive exact gradient computations. Using cheap\nstochastic gradients, one can avoid computing full gradients altogether by using Stochastic Gradient\nDescent (SGD). The cost is, naturally, a reduced rate of convergence. Nevertheless, many machine\nlearning applications require only mild suboptimality, in which case cheap iterations often outweigh\nthe reduced convergence rate. When nearly optimal solutions are sought, more recent variance\nreduced methods (e.g. SVRG [14]) are effective in reducing the number of full gradient computations\nto O(1). For non-convex objectives, the stochastic methods are even more attractive to use as they\nallow to avoid a bad local optimum. However, closed-form formulas for computing the full gradients\nof spectral functions do not lead to ef\ufb01cient stochastic gradients in a straightforward manner.\nContribution.\nIn this paper, we propose stochastic methods for solving (1) when the spectral\nfunction F is a spectral-sum. Formally, spectral-sums are spectral functions that can be expressed as\nF (X) = tr (f (X)) where f is a real-valued function that is lifted to the symmetric matrix domain by\napplying it to the eigenvalues. They constitute an important subclass of spectral functions, e.g., in all\nof the aforementioned applications of spectral optimization, the spectral function F is a spectral-sum.\nOur algorithms are based on recent biased estimators for spectral-sums that combine stochastic trace\nestimation with Chebyshev expansion [11]. The technique used to derive these estimators can also be\nused to derive stochastic estimators for the gradient of spectral-sums (e.g., see [7]), but the resulting\nestimator is biased. To address this issue, we propose an unbiased estimator for spectral-sums, and\nuse it to derive unbiased stochastic gradients. Our unbiased estimator is based on randomly selecting\nthe truncation degree in the Chebyshev expansion, i.e., the truncated polynomial degree is drawn\nunder some distribution. We remark that similar ideas of sampling unbiased polynomials have been\nstudied in the literature, but for different setups [4, 16, 28, 25], and none of which are suitable for\nuse in our setup.\nWhile deriving unbiased estimators is very useful for ensuring stable convergence of stochastic\ngradient methods, it is not suf\ufb01cient: convergence rates of stochastic gradient descent methods\ndepend on the variance of the stochastic gradients, and this can be rather large for na\u00efve choices\nof degree distributions. Thus, our main contribution is in establishing a provably optimal degree\ndistribution minimizing the estimators\u2019 variances with respect to the Chebyshev series. The proposed\ndistribution gives order-of-magnitude smaller variances compared to other popular ones (Figure 1),\nwhich leads to improved convergence of the downstream optimization (Figure 2(c)).\nWe leverage our proposed unbiased estimators to design two stochastic gradient descent methods, one\nusing the SGD framework and the other using the SVRG one. We rigorously analyze their convergence\nrates, showing sublinear and linear rate for SGD and SVRG, respectively. It is important to stress that\nour fast convergence results crucially depend on the proposed optimal degree distributions. Finally,\nwe apply our algorithms to two machine learning tasks that involve spectral optimization: matrix\ncompletion and learning Gaussian processes. Our experimental results con\ufb01rm that the proposed\nalgorithms are signi\ufb01cantly faster than other competitors under large-scale real-world instances. In\nparticular, for learning Gaussian process under Szeged humid dataset, our generic method runs up to\nsix times faster than the state-of-art method [7] specialized for the purpose.\n\n2 Preliminaries\nWe denote the family of real symmetric matrices of dimension d by S d\u00d7d. For A \u2208 S d\u00d7d, we use\n(cid:107)A(cid:107)mv to denote the time-complexity of multiplying A with a vector, i.e., (cid:107)A(cid:107)mv = O(d2). For some\nstructured matrices, e.g. low-rank, sparse or Toeplitz matrices, it is possible to have (cid:107)A(cid:107)mv = o(d2).\n\n2.1 Chebyshev expansion\nLet f : R \u2192 R be an analytic function on [a, b] for a, b \u2208 R. Then, the Chebyshev series of f is\ngiven by\n\n(cid:18) 2\n\n\u221e(cid:88)\n\nj=0\n\nf (x) =\n\nbjTj\n\nb \u2212 a\n\nx \u2212 b + a\nb \u2212 a\n\n, bj =\n\n(cid:19)\n\n(cid:90) 1\n\nf(cid:0) b\u2212a\n\n2 \u2212 1j=0\n\n\u03c0\n\n\u22121\n\n\u221a\n\n2 x + b+a\n1 \u2212 x2\n\n2\n\n(cid:1) Tj(x)\n\ndx.\n\nIn the above, 1j=0 = 1 if j = 0 and 0 otherwise and Tj(x) is the Chebyshev polynomial (of the \ufb01rst\nkind) of degree j. An important property of the Chebyshev polynomials is the following recursive\n\n2\n\n\fj=0 bjTj( 2\n\n(cid:80)n\nSpeci\ufb01cally, if f is analytic with(cid:12)(cid:12)f ( b\u2212a\n\nformula: Tj+1(x) = 2xTj(x) \u2212 Tj\u22121(x), T1(x) = x, T0(x) = 1. The Chebyshev series can\nbe used to approximate f (x) via simply truncating the higher order terms, i.e., f (x) \u2248 pn(x) :=\nb\u2212a ). We call pn(x) the truncated Chebyhshev series of degree n. For analytic\nfunctions, the approximation error (in the uniform norm) is known to decay exponentially [26].\nthe ellipse with foci +1,\u22121 and sum of major and minor semi-axis lengths equals to \u03c1 > 1, then\n\n2 )(cid:12)(cid:12) \u2264 U for some U > 0 in the region bounded by\n\nb\u2212a x \u2212 b+a\n\n2 z + b+a\n\n|bj| \u2264 2U\n\u03c1j ,\n\n\u2200 j \u2265 0,\n\nsup\nx\u2208[a,b]\n\n|f (x) \u2212 pn(x)| \u2264\n\n4U\n\n(\u03c1 \u2212 1) \u03c1n .\n\n(2)\n\n2.2 Spectral-sums and their Chebyshev approximations\nGiven a matrix A \u2208 S d\u00d7d and a function f : R \u2192 R, the spectral-sum of A with respect to f is\n\n\u03a3f (A) := tr (f (A)) =\n\nf (\u03bbi),\n\nd(cid:88)\n\ni=1\n\nwhere tr (\u00b7) is the matrix trace and \u03bb1, \u03bb2, . . . , \u03bbd are the eigenvalues of A. Spectral-sums constitute\nan important subclass of spectral functions, and many applications of spectral optimization involve\nspectral-sums. This is fortunate since spectral-sums can be well approximated using Chebyshev\napproximations.\nFor a general f, one needs all eigenvalues to compute \u03a3f (A), while for some functions, simpler\ntypes of decomposition might suf\ufb01ce (e.g., log det A = \u03a3log(A) can be computed using the Cholesky\ndecomposition). Therefore, the general complexity of computing spectral-sums is O(d3), which is\nclearly not feasible when d is very large, as is common in many machine learning applications. Hence,\nit is not surprising that recent literature proposed methods to approximate the large-scale spectral-\nsums, e.g., [11] recently suggested a fast randomized algorithm for approximating spectral-sums\nbased on Chebyshev series and Monte-Carlo trace estimators (i.e., Hutchinson\u2019s method [13]):\n\nM(cid:88)\n\n(cid:2)v(cid:62)pn(A)v(cid:3) \u2248 1\n(cid:16) 2\nb\u2212a A \u2212 b+a\nb\u2212a I\n\nk=1\n\nM\n\nv(k)(cid:62)\n\n\uf8eb\uf8ed n(cid:88)\n(cid:17)\n\nj=0\n\n\uf8f6\uf8f8 (3)\n\nbjw(k)\n\nj\n\n\u03a3f (A) = tr (f (A)) \u2248 tr (pn(A)) = Ev\n\n(cid:16) 2\nb\u2212a A \u2212 b+a\nb\u2212a I\n\n(cid:17)\n\nj \u2212 w(k)\nw(k)\n\n1 =\n\nj+1 = 2\n\nv, w(k)\n\nj\u22121, w(k)\n\nwhere w(k)\n0 = v(k), and\n{v(k)}M\nk=1 are Rademacher random vectors, i.e., each coordinate of v(k) is an i.i.d. random variable\nin {\u22121, 1} with equal probability 1/2 [13, 2, 24]. The approximation (3) can be computed using only\nmatrix-vector multiplications, vector-vector inner-products and vector-vector additions O(M n) times\neach. Thus, the time-complexity becomes O(M n(cid:107)A(cid:107)mv + M nd) = O(M n(cid:107)A(cid:107)mv). In particular,\nwhen M n (cid:28) d and (cid:107)A(cid:107)mv = o(d2), the cost can be signi\ufb01cantly cheaper than O(d3) of exact\ncomputation. We further note that to apply the approximation (3), a bound on the eigenvalues is\nnecessary. For an upper bound, one can use fast power methods [6]; this does not hurt the total\nalgorithm complexity (see [10]). A lower bound can be encforced by substituting A with The lower\nbound can typically be ensured A + \u03b5I for some small \u03b5 > 0. We use these techniques in our\nnumerical experiments.\nWe remark that one may consider other polynomial approximation schemes, e.g. Taylor, but we focus\non the Chebyshev approximations since they are nearly optimal in approximation among polynomial\nseries [19]. Another recently suggested powerful technique is stochastic Lanczos quadrature [27],\nhowever it is not suitable for our needs (our bias removal technique is not applicable for it).\n\n3 Stochastic Chebyshev gradients of spectral-sums\n\nOur main goal is to develop scalable methods for solving the following optimization problem:\n\nmin\n\u03b8\u2208C\u2286Rd(cid:48) \u03a3f (A(\u03b8)) + g(\u03b8),\n\n(4)\n\nwhere C \u2286 Rd(cid:48)\nparameter \u03b8 = [\u03b8i] \u2208 Rd(cid:48)\n\nis a non-empty, closed and convex domain, A : Rd(cid:48) \u2192 S d\u00d7d is a function of\nand g : Rd(cid:48) \u2192 R is some function whose derivative with respect to\n\n3\n\n\fany parameter \u03b8 is computationally easy to obtain. Gradient-descent type methods are natural\ncandidates for tackling such problems. However, while it is usually possible to compute the gradient\nof \u03a3f (A(\u03b8)), this is typically very expensive. Thus, we turn to stochastic methods, like (projected)\nSGD [3, 31] and SVRG [14, 30]. In order to apply stochastic methods, one needs unbiased estimators\nof the gradient. The goal of this section is to propose a computationally ef\ufb01cient method to generate\nunbiased stochastic gradients of small variance for \u03a3f (A(\u03b8)).\n\n3.1 Stochastic Chebyshev gradients\n\napproximation is exact, i.e., f (x) = pn(x) =(cid:80)n\n\nBiased stochastic gradients. We begin by observing that if f is a polynomial itself or the Chebyshev\n\n\u2202\n\u2202\u03b8i\n\n\u03a3pn (A) =\n\n\u2202\n\u2202\u03b8i\n\n\u2248 1\nM\n\nM(cid:88)\n\nk=1\n\ntr (pn(A)) =\n\n\u2202\n\u2202\u03b8i\n\nEv\n\nv(k)(cid:62)pn(A)v(k) =\n\n\u2202\n\u2202\u03b8i\n\nb\u2212a ), we have\n\nj=0 bjTj( 2\n\n(cid:20) \u2202\n(cid:2)v(cid:62)pn(A)v(cid:3) = Ev\n\uf8eb\uf8ed n(cid:88)\n\nb\u2212a x \u2212 b+a\nM(cid:88)\n\nv(k)(cid:62)\n\n\u2202\u03b8i\n\n1\nM\n\nk=1\n\nj=0\n\n(cid:21)\n\uf8f6\uf8f8 1,\n\nv(cid:62)pn(A)v\n\n\u2202w(k)\nj\n\u2202\u03b8i\n\nbj\n\n(5)\n\nwhere {v(k)}M\nrecursive formula:\n\nk=1 are i.i.d. Rademacher random vectors and \u2202w(k)\n\nj /\u2202\u03b8i are given by the following\n\nj + 2(cid:101)A\n\n\u2202w(k)\n0\n\u2202\u03b8i\n\n\u2202w(k)\n1\n\u2202\u03b8i\n\n\u2202w(k)\nj+1\n\u2202\u03b8i\n\n\u2202w(k)\nj\n\u2202\u03b8i\n\n\u2212 \u2202w(k)\nj\u22121\n\u2202\u03b8i\n\n,\n\n=\n\n=\n\n\u2202\u03b8i\n\n(6)\n\n= 0,\n\nv(k),\n\nw(k)\n\n\u2202A\n\u2202\u03b8i\n\n2\nb \u2212 a\n\n4\nb \u2212 a\n\ni=1 (cid:107) \u2202A\n\nand (cid:101)A = 2\ndegree n can be computed in O(M n((cid:107)A(cid:107)mv d(cid:48) +(cid:80)d(cid:48)\n\n\u2202A\n\u2202\u03b8i\nb\u2212a A \u2212 b+a\nb\u2212a I. We note that in order to compute (6) only matrix-vector products with\nA and \u2202A/\u2202\u03b8i are needed. Thus, stochastic gradients of spectral-sums involving polynomials of\n(cid:107)mv)). As we shall see in Section 5,\nthe complexity can be further reduced in certain cases. The above estimator can be leveraged to\napproximate gradients for spectral-sums of analytic functions via the truncated Chebyshev series:\n\u2207\u03b8\u03a3f (A(\u03b8)) \u2248 \u2207\u03b8\u03a3pn (A(\u03b8)). Indeed, [7] recently explored this in the context of Gaussian process\nkernel learning. However, if f is not a polynomial, the truncated Chebyshev series pn is not equal\nto f, so the above estimator is biased, i.e. \u2207\u03b8\u03a3f (A) (cid:54)= E[\u2207\u03b8v(cid:62)pn(A)v]. The biased stochastic\ngradients might hurt iterative stochastic optimization as biased errors accumulate over iterations.\nUnbiased stochastic gradients. The estimators (3) and (5) are biased since they approximate an\nanalytic function f via a polynomial pn of \ufb01xed degree. Unless f is a polynomial itself, there\nexists an x0 (usually uncountably many) for which f (x0) (cid:54)= pn(x0), so if A has an eigenvalue at\nx0 we have \u03a3f (A) (cid:54)= \u03a3pn(A). Thus, one cannot hope that the estimator (3), let alone the gradient\nestimator (5), to be unbiased for all matrices A. To avoid deterministic truncation errors, we simply\nrandomize the degree, i.e., design some distribution D on polynomials such that for every x we have\nEp\u223cD [p(x)] = f (x). This guarantees Ep\u223cD [tr (p(A))] = \u03a3f (A) from the linearity of expectation.\nWe propose to build such a distribution on polynomials by using truncated Chebyshev expansions\nwhere the truncation degree is stochastic. Let {qi}\u221e\ni=0 \u2286 [0, 1] be a set of numbers such that\n\ni=0 qi = 1 and(cid:80)\u221e\n(cid:80)\u221e\n\n(cid:19)\ni=r qi > 0 for all r \u2265 0. We now de\ufb01ne for r = 0, 1, . . .\n(cid:98)pr (x) :=\n\n(cid:18) 2\n\n1 \u2212(cid:80)j\u22121\n\nx \u2212 b + a\nb \u2212 a\n\nr(cid:88)\n\nb \u2212 a\n\nbj\ni=0 qi\n\nTj\n\n.\n\nj=0\n\ni=0.\nNext, let n be a random variable taking non-negative integer values, and de\ufb01ned according to\nof \u03a3f (A) and \u2207\u03b8\u03a3f (A) as stated in the following lemma.\n\nNote that(cid:98)pr (x) can be obtained from pr(x) by re-weighting each coef\ufb01cient according to {qi}\u221e\nPr(n = r) = qr. Under certain conditions on {qi},(cid:98)pn (\u00b7) can be used to derive unbiased estimators\nLemma 1 Suppose that f is an analytic function and(cid:98)pn is the randomized Chebyshev series of f in\n\n(7). Assume that the entries of A are differentiable for \u03b8 \u2208 C(cid:48), where C(cid:48) is an open set containing\nC, and that for a, b \u2208 R all the eigenvalues of A(\u03b8) for \u03b8 \u2208 C(cid:48) are in [a, b]. For any degree\n1We assume that all partial derivatives \u2202Aj,k/\u2202\u03b8i for j, k = 1, . . . , d, i = 1, . . . , d(cid:48) exist and are continuous.\n\n(7)\n\n4\n\n\fdistribution on non-negative integers {qi \u2208 (0, 1) :(cid:80)\u221e\nlimn\u2192\u221e(cid:80)\u221e\ni=n+1 qi(cid:98)pn (x) = 0 for all x \u2208 [a, b], it holds\n\n(cid:2)v(cid:62)(cid:98)pn (A) v(cid:3) = \u03a3f (A),\n\nEv,n\n\ni=0 qi = 1,(cid:80)\u221e\n(cid:2)\u2207\u03b8v(cid:62)(cid:98)pn (A) v(cid:3) = \u2207\u03b8\u03a3f (A).\n\nEv,n\n\nr=i qr > 0,\u2200i \u2265 0} satisfying\n\n(8)\n\nwhere the expectations are taken over the joint distribution on random degree n and Rademacher\nrandom vector v (other randomized probing vectors can be used as well).\n\nThe proof of Lemma 1 is given in the supplementary material. We emphasize that (8) holds for any\ndistribution {qi}\u221e\ni=0 on non-negative integers for which the conditions stated in Lemma 1 hold, e.g.,\ngeometric, Poisson or negative binomial distribution.\n\n3.2 Main result: optimal unbiased Chebyshev gradients\n\nIt is a well-known fact that stochastic gradient methods converge faster when the gradients have\nsmaller variances. The variance of our proposed unbiased estimators crucially depends on the choice\nof the degree distribution, i.e., {qi}\u221e\ni=0. In this section, we design a degree distribution that is variance-\noptimal in some formal sense. The variance of our proposed degree distribution decays exponentially\nwith the expected degree, and this is crucial for for the convergence analysis (Section 4).\nThe degrees-of-freedoms in choosing {qi}\u221e\ni=0 is in\ufb01nite, which poses a challenge for devising low-\nvariance distributions. Our approach is based on the following simpli\ufb01ed analytic approach studying\nthe scalar function f in such a way that one can naturally expect that the resulting distribution\n{qi}\u221e\ni=0 also provides low-variance for the matrix cases of (8). We begin by de\ufb01ning the variance of\nrandomized Chebyshev expansion (7) via the Chebyshev weighted norm as\ng( b\u2212a\n2 x + b+a\n\u221a\n1 \u2212 x2\n\n(cid:3) , where (cid:107)g(cid:107)2\n\nVarC ((cid:98)pn) := En\n\n(cid:2)(cid:107)(cid:98)pn \u2212 f(cid:107)2\n\n(cid:90) 1\n\n2 )2\n\nC :=\n\n(9)\n\ndx.\n\nC\n\n\u22121\n\nThe primary reason why we consider the above variance is because by utilizing the orthogonality of\nChebyshev polynomials we can derive an analytic expression for it.\nLemma 2 Suppose {bj}\u221e\n\nj=0 are coef\ufb01cients of the Chebyshev series for analytic function f and(cid:98)pn is\n(cid:17)\n(cid:16) (cid:80)j\u22121\n1\u2212(cid:80)j\u22121\n\nits randomized Chebyshev expansion (7). Then, it holds that VarC ((cid:98)pn) = \u03c0\n\n(cid:80)\u221e\n\nj=1 b2\nj\n\ni=0 qi\n\n2\n\n.\n\ni=0 qi\n\nThe proof of Lemma 2 is given in the supplementary material. One can observe from this result that the\nvariance reduces as we assign larger masses to to high degrees (due to exponentially decaying property\nof bj (2)). However, using large degrees increases the computational complexity of computing the\nestimators. Hence, we aim to design a good distribution given some target complexity, i.e., the\n\nexpected polynomial degree N. Namely, the minimization of VarC ((cid:98)pn) should be constrained by\n(cid:80)\u221e\ni=1 iqi = N for some parameter N \u2265 0.\nHowever, minimizing VarC ((cid:98)pn) subject to the aforementioned constraints might be generally in-\ni=0 is in\ufb01nite and the algebraic structure of {bj}\u221e\ntractable as the number of variables {qi}\u221e\ni=0 is\narbitrary. Hence, in order to derive an analytic or closed-form solution, we relax the optimization.\nIn particular, we suggest the following optimization to minimize an upper bound of the variance by\nutilizing |bj| \u2264 2U \u03c1\u2212j from (2) as follows:\n\nsubject to\n\niqi = N,\n\nqi = 1 and qi \u2265 0.\n\n(10)\n\n\u221e(cid:88)\n\nj=1\n\n\u03c1\u22122j\n\n(cid:33)\n\n(cid:32) (cid:80)j\u22121\n1 \u2212(cid:80)j\u22121\n\ni=0 qi\n\ni=0 qi\n\nmin\n{qi}\u221e\ni=0\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\ni=1\n\ni=0\n\nj \u2248 c\u03c1\u22122j for constant c > 0 under f (x) = log x,\nFigure 1(d) empirically demonstrates that b2\nin which case the above relaxed optimization (10) is nearly tight. The next theorem establishes\nthat (10) has a closed-form solution, despite having in\ufb01nite degrees-of-freedom. The theorem\nis applicable when knowing a \u03c1 > 1 and a bound U such that the function f is analytic with\n\n(cid:1)(cid:12)(cid:12) \u2264 U in the complex region bounded by the ellipse with foci +1,\u22121 and sum of\n\n(cid:12)(cid:12)f(cid:0) b\u2212a\n\n2 z + b+a\n\n2\n\nmajor and minor semi-axis lengths is equal to \u03c1 > 1.\n\n5\n\n\f(a) f (x) = log(x)\n\n(b) f (x) = x0.5\n\n(c) f (x) = exp(x)\n\n(d) coef\ufb01cients of log(x)\n\n\u221a\nFigure 1: Chebyshev weighted variance for three distinct distributions: negative binomial (neg),\nPoisson (pois) and the optimal distribution (11) (opt) with the same mean N under (a) log x, (b)\nx\non [0.05, 0.95] and (c) exp(x) on [\u22121, 1], respectively. Observe that \u201copt\u201d has the smallest variance\nj and c\u03c1\u22122j for some constant c > 0 and log x.\namong all distributions. (d) Comparison between b2\n\nTheorem 3 Let K = max{0, N \u2212(cid:106) \u03c1\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\nq\u2217\ni =\n\n(cid:107)}. The optimal solution {q\u2217\n\ni }\u221e\nfor i < K\n1 \u2212 (N \u2212 K) (\u03c1 \u2212 1)\u03c1\u22121\nfor i = K\n(N \u2212 K)(\u03c1 \u2212 1)2\u03c1\u2212i\u22121+N\u2212K for i > K,\n\nand it satis\ufb01es the unbiasedness condition in Lemma 1, i.e., limn\u2192\u221e(cid:80)\u221e\n\n\u03c1\u22121\n\ni(cid:98)pn (x) = 0.\n\ni=n+1 q\u2217\n\ni=0 of (10) is\n\n(11)\n\ni } , large degrees will be sampled with exponentially small probability.\n\nThe proof of Theorem 3 is given in the supplementary material. Observe that a degree smaller than K\nis never sampled under {q\u2217\ni }, which means that the corresponding unbiased estimator (7) combines\ndeterministic series of degree K with randomized ones of higher degrees. Due to the geometric decay\nof {q\u2217\nThe optimality of the proposed distribution (11) (labeled opt) is illustrated by comparing it numerically\nto other distributions: negative binomial (labeled neg) and Poisson (labeled pois), on three analytic\nfunctions: log x,\nx and exp(x). Figures 1(a) to 1(c) show the weighted variance (9) of these\ndistributions where their means are commonly set from N = 5 to 100. Observe that the proposed\ndistribution has order-of-magnitude smaller variance compared to other tested distributions.\n\n\u221a\n\n4 Stochastic Chebyshev gradient descent algorithms\n\nIn this section, we leverage unbiased gradient estimators based on (8) in conjunction with our optimal\ndegree distribution (11) to design computationally ef\ufb01cient methods for solving (4). In particular, we\npropose to randomly sample a degree n from (11) and estimate the gradient via Monte-Carlo method:\n\n(cid:20) \u2202\n\n\u2202\u03b8i\n\n(cid:21)\n\nv(cid:62)(cid:98)pn (A) v\n\n\u2202\n\u2202\u03b8i\n\n\u03a3f (A) = E\n\nM(cid:88)\n\n\u2248 1\nM\n\nv(k)(cid:62)\n\nk=1\n\nj=0\n\n1 \u2212(cid:80)j\u22121\n\nbj\ni=0 q\u2217\n\ni\n\n\u2202w(k)\nj\n\u2202\u03b8i\n\n\uf8eb\uf8ed n(cid:88)\n\n\uf8f6\uf8f8\n\n(12)\n\nwhere \u2202w(k)\n\nj /\u2202\u03b8i can be computed using a Rademacher vector v(k) and the recursive relation (6).\n\n4.1 Stochastic Gradient Descent (SGD)\n\nIn this section, we consider the use of projected SGD in conjunction with (12) to numerically solve the\noptimization (4). In the following, we provide a pseudo-code description of our proposed algorithm.\n\nAlgorithm 1 SGD for solving (4)\n1: Input: number of iterations T , number of Rademacher vectors M, expected degree N and \u03b8(0)\n2: for t = 0 to T \u2212 1 do\n3:\n4:\n5:\n6:\n7: end for\n\n(cid:0)\u03c8(t) + \u2207g(\u03b8(t))(cid:1)(cid:1), where \u03a0C (\u00b7) is the projection mapping into C\n\nDraw M Rademacher random vectors {v(k)}M\nCompute \u03c8(t) from (12) at \u03b8(t) using {v(k)}M\nObtain a proper step-size \u03b7t\n\nk=1 and a random degree n from (11) given N\nk=1 and n\n\n\u03b8(t+1) \u2190 \u03a0C(cid:0)\u03b8(t) \u2212 \u03b7t\n\n6\n\n020406080100expected degree N10-2010-10100weighted variancenegpoisopt020406080100expected degree N10-2010-10100weighted variancenegpoisopt020406080100expected degree N10-2010-10100weighted variancenegpoisopt020406080degree j10-4010-20100\fIn order to analyze the convergence rate, we assume that (A0) all eigenvalues of A(\u03b8) for \u03b8 \u2208 C(cid:48) are in\nthe interval [a, b] for some open C(cid:48) \u2287 C, (A1) \u03a3f (A(\u03b8)) + g(\u03b8) is continuous and \u03b1-strongly convex\nwith respect to \u03b8 and (A2) A(\u03b8) is LA-Lipschitz for (cid:107)\u00b7(cid:107)F , g(\u03b8) is Lg-Lipschitz and \u03b2g-smooth. The\nformal de\ufb01nitions of the assumptions are in the supplementary material. These assumptions hold\nfor many target applications, including the ones explored in Section 5. In particular, we note that\nassumption (A0) can be often satis\ufb01ed with a careful choice of C. It has been studied that (projected)\nSGD has a sublinear convergence rate for a smooth strongly-convex objective if the variance of\ngradient estimates is uniformly bounded [23, 21]. Motivated by this, we \ufb01rst derive the following\nupper bound on the variance of gradient estimators under the optimal degree distribution (11).\nLemma 4 Suppose that assumptions (A0)-(A2) hold and A(\u03b8) is Lnuc-Lipschitz for (cid:107)\u00b7(cid:107)nuc. Let \u03c8\nbe the gradient estimator (12) at \u03b8 \u2208 C using Rademacher vectors {v(k)}M\nk=1 and degree n drawn\nfrom the optimal distribution (11). Then, Ev,n[(cid:107)\u03c8(cid:107)2\nA/M + d(cid:48)L2\nwhere C1, C2 > 0 are some constants independent of M, N.\n\n(cid:1)(cid:0)C1 + C2N 4\u03c1\u22122N(cid:1)\n\n2] \u2264(cid:0)2L2\n\nnuc\n\n(cid:18)\n\n(cid:18) 2L2\n\nA\nM\n\nThe above lemma allows us to provide a sublinear convergence rate for Algorithm 1.\nTheorem 5 Suppose that assumptions (A0)-(A2) hold and A(\u03b8) is Lnuc-Lipschitz for (cid:107)\u00b7(cid:107)nuc. If one\nchooses the step-size \u03b7t = 1/\u03b1t, then it holds that\n\n(cid:19)(cid:18)\n\n(cid:19)(cid:19)\n\nE[(cid:107)\u03b8(T ) \u2212 \u03b8\u2217(cid:107)2\n\n2] \u2264 4\n\u03b12T\n\nmax\n\nL2\ng,\n\n+ d(cid:48)L2\n\nnuc\n\nC1 +\n\nC2N 4\n\u03c12N\n\nwhere C1, C2 > 0 are constants independent of M, N, and \u03b8\u2217 \u2208 C is the global optimum of (4).\nThe proofs of Lemma 4 and Theorem 5 are given in the supplementary material. Note that larger\nM, N provide better convergence but they increase the computational complexity. The convergence\nis also faster with smaller d(cid:48), which is also evident in our experiments (see Section 5).\n4.2 Stochastic Variance Reduced Gradient (SVRG)\nIn this section, we introduce a more advanced stochastic method using a further variance reduction\ntechnique, inspired by the stochastic variance reduced gradient method (SVRG) [14]. The full\ndescription of the proposed SVRG scheme for solving the optimization (4) is given below.\n\n2: (cid:101)\u03b8(1) \u2190 \u03b8(0)\n\nAlgorithm 2 SVRG for solving (4)\n1: Input: number of inner/outer iterations T, S, number of Rademacher vectors M, expected degree\n\nN, step-size \u03b7 and initial parameter \u03b8(0) \u2208 C\n\nfor t = 0 to T \u2212 1 do\n\nDraw M Rademacher random vectors {v(k)}M\n\n(cid:101)\u00b5(s) \u2190 \u2207\u03a3f (A((cid:101)\u03b8(s))) and \u03b8(0) \u2190(cid:101)\u03b8(s)\nCompute \u03c8(t),(cid:101)\u03c8(s) from (12) at \u03b8(t) and(cid:101)\u03b8(s), respectively using {v(k)}M\n(cid:101)\u03b8(s+1) \u2190 1\n\n(cid:17)(cid:17)\n\u03c8(t) \u2212 (cid:101)\u03c8(s) +(cid:101)\u00b5(s) + \u2207g(\u03b8(t))\n\n3: for s = 1 to S do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\n(cid:16)\n(cid:80)T\n\n\u03b8(t+1) \u2190 \u03a0C\n\n\u03b8(t) \u2212 \u03b7\n\nt=1 \u03b8(t)\n\nT\n\nend for\n\n(cid:16)\n\nk=1 and a random degree n from (11)\nk=1 and n\n\ndesigned for optimizing \ufb01nite-sum objectives, i.e.,(cid:80)\n\nThe main idea of SVRG is to subtract a mean-zero random variable to the original stochastic gradient\nestimator, where the randomness between them is shared. The SVRG algorithm was originally\ni fi(x), whose randomness is from the index\ni. On the other hand, the randomness in our case is from polynomial degrees and trace probing\nvectors for optimizing objectives of spectral-sums. This leads us to use the same randomness in\n{v(k)}M\nSGD, Algorithm 2 requires the expensive computation of exact gradients every T iterations. The next\ntheorem establishes that if one sets T correctly only O(1) gradient computations are required (for a\n\ufb01xed suboptimality) since we have a linear convergence rate.\n\nk=1 and n for estimating both \u03c8(t) and (cid:101)\u03c8(s) in line 7 of Algorithm 2. We remark that unlike\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Matrix completion results under (a) MovieLens 1M and (b) MovieLens 10M. (c) Algorithm\n1 (SGD) in MovieLens 1M under other distributions such as negative binomial (neg) and Poisson\n(pois). (d) SGD and SGD-DET under N = 10, 30.\n(cid:16) L4\nTheorem 6 Suppose that assumptions (A0)-(A2) hold and A(\u03b8) is \u03b2A-smooth for (cid:107)\u00b7(cid:107)F . Let \u03b22 =\nfor some constants D1, D2 > 0 independent of M, N. Choose\n2\u03b22\ng +\n7\u03b22 and T \u2265 25\u03b22/\u03b12. Then, it holds that\nE[(cid:107)(cid:101)\u03b8(S) \u2212 \u03b8\u2217(cid:107)2\n\u03b7 = \u03b1\n\n2] \u2264 rSE[(cid:107)\u03b8(0) \u2212 \u03b8\u2217(cid:107)2\n2],\n\nD1 + D2N 8\n\u03c12N\n\n(cid:17)(cid:16)\n\nA+\u03b22\nM + L4\n\nA\n\nA\n\n(cid:17)\n\nwhere 0 < r < 1 is some constant and \u03b8\u2217 \u2208 C is the global optimum of (4).\n\nThe proof of the above theorem is given in the supplementary material, where we utilize the recent\nanalysis of SVRG for the sum of smooth non-convex objectives [9, 1]. The key additional component\nin our analysis is to characterize \u03b2 > 0 in terms of M, N so that the unbiased gradient estimator (12)\nis \u03b2-smooth in expectation under the optimal degree distribution (11).\n\n5 Applications\n\nIn this section, we apply the proposed methods to two machine learning tasks: matrix completion and\nlearning Gaussian processes. These correspond to minimizing spectral-sums \u03a3f with f (x) = x1/2\nand log x, respectively. We evaluate our methods under real-world datasets for both experiments.\n\n5.1 Matrix completion\nThe goal is to recover a low-rank matrix \u03b8 \u2208 [0, 5]d\u00d7r when a few of its entries are given. Since the\nrank function is neither differentiable nor convex, its relaxation such as Schatten-p norm has been\nused in respective optimization formulations. In particular, we consider the smoothed nuclear norm\n(i.e., Schatten-1 norm) minimization [18, 20] that corresponds to\n\nmin\n\n\u03b8\u2208[0,5]d\u00d7r\n\ntr(A1/2) + \u03bb\n\n(\u03b8i,j \u2212 Ri,j)2\n\n(cid:88)\n\n(i,j)\u2208\u2126\n\nwhere A = \u03b8\u03b8(cid:62) + \u03b5I, R \u2208 [0, 5]d\u00d7r is a given matrix with missing entries, \u2126 indicates the positions\nof known entries and \u03bb is a weight parameter and \u03b5 > 0 is a smoothing parameter. Observe that\n(cid:107)A(cid:107)mv = (cid:107)\u03b8(cid:107)mv = O(dr), and the derivative estimation in this case can be amortized to compute\nusing O(dM (N 2 + N r)) operations. More details on this and our experimental settings are given in\nthe supplementary material.\nWe use the MovieLens 1M and 10M datasets [12] (they correspond to d = 3, 706 and 10, 677,\nrespectively) and benchmark the gradient descent (GD), Algorithm 1 (SGD) and Algorithm 2\n(SVRG). We also consider a variant of SGD using a deterministic polynomial degree, referred as\nSGD-DET, where it uses biased gradient estimators. We report the results for MovieLens 1M in\nFigure 2(a) and 10M in 2(b). For both datasets, SGD-DET performs badly due to its biased gradient\nestimators. On the other hand, SGD converges much faster and outperforms GD, where SGD for\n10M converges much slower than that for 1M due to the larger dimension d(cid:48) = dr (see Theorem 5).\nObserve that SVRG is the fastest one, e.g., compared to GD, about 2 times faster to achieve RMSE\n1.5 for MovieLens 1M and up to 6 times faster to achieve RMSE 1.8 for MovieLens 10M as shown\nin Figure 2(b). The gap between SVRG and GD is expected to increase for larger datasets. We also\ntest SGD under other degree distributions: negative binomial (neg) and Poisson (pois) by choosing\n\n8\n\n0200400600elapsed time [sec]1.522.53test RMSEGDSGD-DETSGDSVRG050001000015000elapsed time [sec]1.522.53test RMSEGDSGD-DETSGDSVRG0204060number of iterations1.61.822.22.42.6test RMSESGD (neg)SGD (pois)SGD (opt)0204060number of iterations1.21.41.61.822.22.42.6test RMSESGD-DET, N=10SGD-DET, N=30SGD, N=10SGD, N=30\fparameters so that their means equal to N = 15. As reported in Figure 2(c), other distributions have\nrelatively large variances so that they converge slower than the optimal distribution (opt). In Figure\n2(d), we compare SGD-DET with SGD of the optimal distribution under the (mean) polynomial\ndegrees N = 10, 30. Observe that a larger degree (N = 30) reduces the bias error in SGD-DET,\nwhile SGD achieves similar error regardless of the degree. The above results con\ufb01rm that the unbiased\ngradient estimation and our degree distribution (11) are crucial for SGD.\n\n5.2 Learning for Gaussian process regression\n\ntraining data(cid:8)xi \u2208 R(cid:96)(cid:9)d\n\nNext, we apply our method to hyperparameter learning for Gaussian process (GP) regression. Given\ni=1 with corresponding outputs y \u2208 Rd, the goal of GP regression is to learn\na hyperparameter \u03b8 for predicting the output of a new/test input. The hyperparameter \u03b8 constructs\nthe kernel matrix A(\u03b8) \u2208 S d\u00d7d of the training data {xi}d\ni=1 (see [22]). One can \ufb01nd a good\nhyperparameter by minimizing the negative log-marginal likelihood with respect to \u03b8:\n\nL := \u2212 log p(cid:0)y|{xi}d\n\ni=1\n\n(cid:1) =\n\ny(cid:62)A(\u03b8)\u22121y +\n\n1\n2\n\n1\n2\n\nlog det A(\u03b8) +\n\nn\n2\n\nlog 2\u03c0.\n\nFor handling large-scale datasets, [29] proposed the structured kernel interpolation framework\nassuming \u03b8 = [\u03b8i] \u2208 R3 and\n\n2 exp(cid:0)(cid:107)xi \u2212 xj(cid:107)2\n\n2/2\u03b82\n3\n\n(cid:1) ,\n\nA(\u03b8) = W KW (cid:62) + \u03b82\n\n1I, Ki,j = \u03b82\n\n\u2202\u03b8i\n\nwhere W \u2208 Rd\u00d7r is some sparse matrix and K \u2208 Rr\u00d7r is a dense kernel with r (cid:28) d. Speci\ufb01cally,\nin [29], r \u201cinducing\u201d points are selected and entries of W are computed via interpolation with the\ninducing points. Under the framework, matrix-vector multiplications with A can be performed even\nfaster, requiring (cid:107)A(cid:107)mv = (cid:107)W(cid:107)mv + (cid:107)K(cid:107)mv = O(d + r2) operations. From (cid:107)A(cid:107)mv = (cid:107) \u2202A\n(cid:107)mv and\nd(cid:48) = 3, the complexity for computing gradient estimation (12) becomes O(M N (d + r2)). If we\nchoose M, N, r = O(1), the complexity reduces to O(d). The more detailed problem description\nand our experimental settings are given in the supplementary material.\nWe benchmark GP regression under natural sound\ndataset used in [29] and Szeged humid dataset [5]\nwhere they correspond to d = 35, 000 and 16, 930,\nrespectively. Recently, [7] utilized an approximation\nto derivatives of log-determinant based on stochastic\nLanczos quadrature [27] (LANCZOS). We compare\nit with Algorithm 1 (SGD) which utilizes with un-\nbiased gradient estimators while SVRG requires the\nexact gradient computation at least once which is in-\ntractable to run in these cases. As reported in Figure\n3, SGD converges faster than LANCZOS for both\ndatasets and it runs 2 times faster to achieve RMSE\n0.0375 under sound dataset and under humid dataset\nLANCZOS can be often stuck at a local optimum,\nwhile SGD avoids it due to the use of unbiased gradi-\nent estimators.\n\nFigure 3: Hyperparameter learning for Gaus-\nsian process in modeling (a) sound dataset and\n(b) Szeged humid dataset comparing SGD to\nstochastic Lanczos quadrature (LANCZOS).\n\n(b)\n\n(a)\n\n6 Conclusion\n\nWe proposed an optimal variance unbiased estimator for spectral-sums and their gradients. We\napplied our estimator in the SGD and SVRG frameworks, and analyzed convergence. The proposed\noptimal degree distribution is a crucial component of the analysis. We believe that the proposed\nstochastic methods are of broader interest in many machine learning tasks involving spectral-sums.\n\nAcknowledgement\n\nThis work was supported by the National Research Foundation of Korea(NRF) grant funded by the\nKorea government(MSIT) (2018R1A5A1059921). Haim Avron acknowledges the support of the\nIsrael Science Foundation (grant no. 1272/17).\n\n9\n\n0100200elapsed time [sec]0.0380.0390.040.041test RMSELANCZOSSGD050100150elapsed time [sec]0.1670.1680.1690.170.171test RMSELANCZOSSGD\fReferences\n[1] Allen-Zhu, Zeyuan and Yuan, Yang. Improved SVRG for non-strongly-convex or sum-of-non-\nconvex objectives. In International Conference on Machine Learning (ICML), pp. 1080\u20131089,\n2016.\n\n[2] Avron, H. and Toledo, S. Randomized algorithms for estimating the trace of an implicit\n\nsymmetric positive semi-de\ufb01nite matrix. Journal of the ACM, 58(2):8, 2011.\n\n[3] Bottou, L\u00e9on. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT\u20192010, pp. 177\u2013186. Springer, 2010.\n\n[4] Broniatowski, Michel and Celant, Giorgio. Some overview on unbiased interpolation and\n\nextrapolation designs. arXiv preprint arXiv:1403.5113, 2014.\n\n[5] Budincsevity, Norbert. Weather in Szeged 2006-2016.\n\nbudincsevity/szeged-weather/data, 2016.\n\nhttps://www.kaggle.com/\n\n[6] Davidson, Ernest R. The iterative calculation of a few of the lowest eigenvalues and correspond-\ning eigenvectors of large real-symmetric matrices. Journal of Computational Physics, 17(1):\n87\u201394, 1975.\n\n[7] Dong, Kun, Eriksson, David, Nickisch, Hannes, Bindel, David, and Wilson, Andrew G. Scalable\nlog determinants for Gaussian process kernel learning. In Advances in Neural Information\nProcessing Systems, pp. 6330\u20136340, 2017.\n\n[8] Friedlander, Michael P. and Mac\u00eado, Ives. Low-rank spectral optimization via gauge duality.\nSIAM Journal on Scienti\ufb01c Computing, 38(3):A1616\u2013A1638, 2016. doi: 10.1137/15M1034283.\nURL https://doi.org/10.1137/15M1034283.\n\n[9] Garber, Dan and Hazan, Elad. Fast and simple PCA via convex optimization. arXiv preprint\n\narXiv:1509.05647, 2015.\n\n[10] Han, Insu, Malioutov, Dmitry, and Shin, Jinwoo. Large-scale log-determinant computation\nthrough stochastic chebyshev expansions. In International Conference on Machine Learning,\npp. 908\u2013917, 2015.\n\n[11] Han, Insu, Malioutov, Dmitry, Avron, Haim, and Shin, Jinwoo. Approximating spectral sums of\nlarge-scale matrices using stochastic chebyshev approximations. SIAM Journal on Scienti\ufb01c\nComputing, 39(4):A1558\u2013A1585, 2017.\n\n[12] Harper, F Maxwell and Konstan, Joseph A. The movielens datasets: History and context. Acm\n\ntransactions on interactive intelligent systems (tiis), 5(4):19, 2016.\n\n[13] Hutchinson, M.F. A stochastic estimator of the trace of the in\ufb02uence matrix for Laplacian\nsmoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059\u2013\n1076, 1989.\n\n[14] Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive\nvariance reduction. In Advances in Neural Information Processing Systems, pp. 315\u2013323, 2013.\n\n[15] Koltchinskii, Vladimir and Xia, Dong. Optimal estimation of low rank density matrices. J.\nMach. Learn. Res., 16(1):1757\u20131792, January 2015. ISSN 1532-4435. URL http://dl.acm.\norg/citation.cfm?id=2789272.2886806.\n\n[16] Lee, Yin Tat, Sidford, Aaron, and Wong, Sam Chiu-wai. A faster cutting plane method and its\nimplications for combinatorial and convex optimization. In Foundations of Computer Science\n(FOCS), 2015 IEEE 56th Annual Symposium on, pp. 1049\u20131065. IEEE, 2015.\n\n[17] Lewis, A. S. Derivatives of spectral functions. Mathematics of Operations Research, 21(3):576\u2013\n588, 1996. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/3690298.\n\n[18] Lu, Canyi, Lin, Zhouchen, and Yan, Shuicheng. Smoothed low rank and sparse matrix recovery\nby iteratively reweighted least squares minimization. IEEE Transactions on Image Processing,\n24(2):646\u2013654, 2015.\n\n10\n\n\f[19] Mason, John C and Handscomb, David C. Chebyshev polynomials. CRC Press, 2002.\n\n[20] Mohan, Karthik and Fazel, Maryam. Iterative reweighted algorithms for matrix rank minimiza-\n\ntion. Journal of Machine Learning Research, 13(Nov):3441\u20133473, 2012.\n\n[21] Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust\nstochastic approximation approach to stochastic programming. SIAM Journal on Optimization,\n19(4):1574\u20131609, 2009.\n\n[22] Rasmussen, Carl Edward. Gaussian processes in machine learning. In Advanced Lectures on\n\nMachine Learning, pp. 63\u201371. Springer, 2004.\n\n[23] Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The Annals of\n\nMathematical Statistics, pp. 400\u2013407, 1951.\n\n[24] Roosta-Khorasani, Farbod and Ascher, Uri. Improved bounds on sample size for implicit matrix\n\ntrace estimators. Foundations of Computational Mathematics, 15(5):1187\u20131212, 2015.\n\n[25] Ryan P Adams, Jeffrey Pennington, Matthew J Johnson Jamie Smith Yaniv Ovadia Brian Patton\nJames Saunderson. Estimating the spectral density of large implicit matrices. arXiv preprint\narXiv:1802.03451, 2018.\n\n[26] Trefethen, Lloyd N. Approximation theory and approximation practice. SIAM, 2013.\n\n[27] Ubaru, Shashanka, Chen, Jie, and Saad, Yousef. Fast estimation of tr(f (a)) via stochastic\nLanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075\u20131099,\n2017.\n\n[28] Vinck, Martin, Battaglia, Francesco P, Balakirsky, Vladimir B, Vinck, AJ Han, and Pennartz,\nCyriel MA. Estimation of the entropy based on its polynomial representation. Physical Review\nE, 85(5):051139, 2012.\n\n[29] Wilson, Andrew and Nickisch, Hannes. Kernel interpolation for scalable structured Gaussian\nprocesses (KISS-GP). In International Conference on Machine Learning, pp. 1775\u20131784, 2015.\n\n[30] Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[31] Zinkevich, Martin. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\nIn Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp.\n928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3682, "authors": [{"given_name": "Insu", "family_name": "Han", "institution": "KAIST"}, {"given_name": "Haim", "family_name": "Avron", "institution": "Tel Aviv University"}, {"given_name": "Jinwoo", "family_name": "Shin", "institution": "KAIST; AITRICS"}]}