{"title": "Dimension-Free Iteration Complexity of Finite Sum Optimization Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 3540, "page_last": 3548, "abstract": "Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure. However, whereas much progress has been made in developing faster algorithms for this setting, the inherent limitations of these problems are not satisfactorily addressed by existing lower bounds. Indeed, current bounds focus on first-order optimization algorithms, and only apply in the often unrealistic regime where the number of iterations is less than $\\cO(d/n)$ (where $d$ is the dimension and $n$ is the number of samples). In this work, we extend the framework of Arjevani et al. \\cite{arjevani2015lower,arjevani2016iteration} to provide new lower bounds, which are dimension-free, and go beyond the assumptions of current bounds, thereby covering standard finite sum optimization methods, e.g., SAG, SAGA, SVRG, SDCA without duality, as well as stochastic coordinate-descent methods, such as SDCA and accelerated proximal SDCA.", "full_text": "Dimension-Free Iteration Complexity of Finite Sum\n\nOptimization Problems\n\nYossi Arjevani\n\nWeizmann Institute of Science\n\nRehovot 7610001, Israel\n\nyossi.arjevani@weizmann.ac.il\n\nOhad Shamir\n\nWeizmann Institute of Science\n\nRehovot 7610001, Israel\n\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nMany canonical machine learning problems boil down to a convex optimization\nproblem with a \ufb01nite sum structure. However, whereas much progress has been\nmade in developing faster algorithms for this setting, the inherent limitations of\nthese problems are not satisfactorily addressed by existing lower bounds. Indeed,\ncurrent bounds focus on \ufb01rst-order optimization algorithms, and only apply in\nthe often unrealistic regime where the number of iterations is less than O(d/n)\n(where d is the dimension and n is the number of samples). In this work, we\nextend the framework of Arjevani et al. [3, 5] to provide new lower bounds, which\nare dimension-free, and go beyond the assumptions of current bounds, thereby\ncovering standard \ufb01nite sum optimization methods, e.g., SAG, SAGA, SVRG,\nSDCA without duality, as well as stochastic coordinate-descent methods, such as\nSDCA and accelerated proximal SDCA.\n\n1\n\nIntroduction\n\nMany machine learning tasks reduce to Finite Sum Minimization (FSM) problems of the form\n\nmin\nw\u2208Rd\n\nF (w) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(w),\n\n(1)\n\nwhere fi are L-smooth and \u00b5-strongly convex. In recent years, a major breakthrough was made\nwhen a linear convergence rate was established for this setting (SAG [16] and SDCA [18]), and since\nthen, many methods have been developed to achieve better convergence rate. However, whereas a\nlarge body of literature is devoted for upper bounds, the optimal convergence rate with respect to the\nproblem parameters is not quite settled.\nLet us discuss existing lower bounds for this setting, along with their shortcomings, in detail. One\napproach to obtain lower bounds for this setting is to consider the average of carefully handcrafted\nfunctions de\ufb01ned on n disjoint sets of variables. This approach was taken by Agarwal and Bottou [1]\nwho derived a lower bound for FSM under the \ufb01rst-order oracle model (see Nemirovsky and Yudin\n[12]). In this model, optimization algorithms are assumed to access a given function by issuing queries\nto an external \ufb01rst-order oracle procedure. Upon receiving a query point in the problem domain, the\noracle reports the corresponding function value and gradient. The construction used by Agarwal and\nBottou consisted of n different quadratic functions which are adversarially determined based on the\n\ufb01rst-order queries being issued during the optimization process. The resulting bound in this case does\nnot apply to stochastic algorithms, rendering it invalid for current state-of-the-art methods. Another\ninstantiation of this approach was made by Lan [10] who considered n disjoint copies of a quadratic\nfunction proposed by Nesterov in [13, Section 2.1.2]. This technique is based on the assumption that\nany iterate generated by the optimization algorithm lies in the span of previously acquired gradients.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThis assumption is rather permissive and is satis\ufb01ed by many \ufb01rst-order algorithms, e.g., SAG and\nSAGA [6]. However, the lower bound stated in the paper faces limitations in a few aspects. First, the\nvalidity of the derived bound is restricted to d/n iterations. In many datasets, even if d, n are very\nlarge, d/n is quite small. Accordingly, the admissible regime of the lower bound is often not very\ninteresting. Secondly, it is not clear how the proposed construction can be expressed as a Regularized\nLoss Minimization (RLM) problem with linear predictors (see Section 4). This suggests that methods\nspecialized in dual RLM problems, such as SDCA and accelerated proximal SDCA [19], can not be\naddressed by this bound. Thirdly, at least the formal theorem requires assumptions (such as querying\nin the span of previous gradients, or sampling from a \ufb01xed distribution over the individual functions),\nwhich are not met by some state-of-the-art methods, such as coordinate descent methods, SVRG [9]\nand without-replacements sampling algorithms [15].\nAnother relevant approach in this setting is to model the functional form of the update rules. This\napproach was taken by Arjevani et al. [3] where new iterates are assumed to be generated by a\nrecurrent application of some \ufb01xed linear transformation. Although this method applies to SDCA\nand produces a tight lower bound of \u02dc\u2126((n + 1/\u03bb) ln(1/\u0001)), its scope is rather limited. In recent\nwork, Arjevani and Shamir [5] considerably generalized parts of this framework by introducing\nthe class of \ufb01rst-order oblivious optimization algorithms, whose step sizes are scheduled regardless\nof the function under consideration, and deriving tight lower bounds for general smooth convex\nminimization problems (note that obliviousness rules out, e.g., quasi-Newton methods where gradients\nobtained at each iteration are multiplied by matrices which strictly depend on the function at hand,\nsee De\ufb01nition 2 below).\nIn this work, building upon the framework of oblivious algorithms, we take a somewhat more\nabstract point of view which allows us to easily incorporate coordinate-descent methods, as well\nas stochastic algorithms. Our framework subsumes the vast majority of optimization methods for\nmachine learning problems, in particular, it applies to SDCA, accelerated proximal SDCA, SDCA\nwithout duality [17], SAG, SAGA, SVRG and acceleration schemes [7, 11]), as well as to a large\nnumber of methods for smooth convex optimization (i.e., FSM with n = 1), e.g., (stochastic) Gradient\ndescent (GD), Accelerated Gradient Descent (AGD, [13]), the Heavy-Ball method (HB, [14]) and\nstochastic coordinate descent.\nUnder this structural assumption, we derive lower bounds for FSM (1), according to which the\niteration complexity, i.e., the number of iterations required to obtain an \u0001-optimal solution in terms of\nfunction value, is at least1\n\n\u02dc\u2126(n +(cid:112)n(\u03ba \u2212 1) ln(1/\u0001)),\n\n(2)\n\nwhere \u03ba denotes the condition number of F (w) (that is, the smoothness parameter over the strong\nconvexity parameter). To the best of our knowledge, this is the \ufb01rst tight lower bound to address all\nthe algorithms mentioned above. Moreover, our bound is dimension-free and thus applies to settings\nin machine learning which are not covered in the current literature (e.g., when n is \u2126(d)). We also\nderive a dimension-free nearly-optimal lower bound for smooth convex optimization of\n\n(cid:16)\n\n(L(\u03b4 \u2212 2)/\u0001)1/\u03b4(cid:17)\n\n\u2126\n\n,\n\nfor any \u03b4 \u2208 (2, 4), which holds for any oblivious stochastic \ufb01rst-order algorithm. It should be noted\nthat our lower bounds remain valid under any source of randomness which may be introduced into\nthe optimization process (by the oracle or by the optimization algorithm). In particular, our bounds\nhold in cases where the variance of the iterates produced by the algorithm converges to zero, a highly\ndesirable property of optimization algorithms in this setting.\nTwo implications can be readily derived from this lower bound. First, obliviousness forms a real\nbarrier for optimization algorithms, and whereas non-oblivious algorithms may achieve a super-linear\nconvergence rate at later stages of the optimization process (e.g., quasi-newton), or practically zero\nerror after \u0398(d) iterations (e.g. Center of Gravity method, MCG), oblivious algorithms are bound\nto linear convergence inde\ufb01nitely, as demonstrated by Figure 1. We believe that this indicates that\na major progress can be made in solving machine learning problems by employing non-oblivious\nmethods for settings where d (cid:28) n. It should be further noted that another major advantage of\n1Following standard conventions, here tilde notation hides logarithmic factors in the parameters of a given\n\nclass of optimization problems, e.g., smoothness parameter and number of components.\n\n2\n\n\fnon-oblivious algorithms is their ability to obtain optimal convergence rates without an explicit\nspeci\ufb01cation of the problem parameters (e.g., [5, Section 4.1]).\n\nFigure 1: Comparison of \ufb01rst-order methods based on the function used by Nesterov in [13, Section\n2.1.2] over R500. Whereas L-BFGS (with a memory size of 100) achieves a super-linear convergence\nrate after \u0398(d) iterations, the convergence rate of GD, AGD and HB remains linear as predicted by\nour bound.\n\nSecondly, many practitioners have noticed that oftentimes sampling the individual functions without\nreplacement at each iteration performs better than sampling with replacement (e.g., [18, 15], see also\n[8, 20]). The fact that our lower bound holds regardless of how the individual functions are sampled\nand is attained using with-replacement sampling (e.g., accelerated proximal SDCA), implies that,\nin terms of iteration complexity, one should expect to gain no more than log factors in the problem\nparameters when using one method over the other (it is noteworthy that when comparing with and\nwithout replacement samplings, apart from iteration complexity, other computational resources, such\nas limited communication in distributed settings [4], may signi\ufb01cantly affect the overall runtime).\n\n2 Framework\n\n2.1 Motivation\n\nDue to dif\ufb01culties which arise when studying the complexity of general optimization problems under\ndiscrete computational models, it is common to analyze the computational hardness of optimization\nalgorithms by modeling the way a given algorithm interacts with the problem instances (without\nlimiting its computational resources). In the seminal work of Nemirovsky and Yudin [12], it is shown\nthat algorithms which access the function at hand exclusively by querying a \ufb01rst-order oracle require\nat least\n\n\u02dc\u2126(cid:0)min(cid:8)d,\n\u03ba(cid:9) ln(1/\u0001)(cid:1) ,\n\u02dc\u2126(min{d ln(1/\u0001),(cid:112)L/\u0001}),\n\n\u221a\n\n\u00b5 > 0\n\n\u00b5 = 0\n\n(3)\n\noracle calls to obtain an \u0001-optimal solution, where L and \u00b5 are the smoothness and the strong\nconvexity parameter, respectively (note that, here and throughout this section we refer to FSM\nproblems with n = 1). This lower bound is tight and its dimension-free part is attained by Nesterov\u2019s\nwell-known accelerated gradient descent, and by MCG otherwise. The fact that this approach is\nbased on information considerations alone is very appealing and renders it valid for any \ufb01rst-order\nalgorithm. However, discarding the resources needed for executing a given algorithm, in particular\nthe per-iteration cost (in time and space), the complexity boundaries drawn by this approach are\ntoo crude from a computational point of view. Indeed, the per-iteration cost of MCG, the only\nmethod known with oracle complexity of O(d ln(1/\u0001)), is excessively high, rendering it prohibitive\nfor high-dimensional problems.\nWe are thus led into the question of how well can a given optimization algorithm perform assuming\nthat its per-iteration cost is constrained? Arjevani et al. [3, 5] adopted a more structural approach\n\n3\n\nNumber of Iterations05001000150020002500Error10-810-610-410-2100102104GDAGDHBL-BFGSLower Bound\fwhere instead of modeling how information regarding the function at hand is being collected, one\nmodels the update rules according to which iterates are being generated. Concretely, they proposed\nthe framework of p-CLI optimization algorithms where, roughly speaking, new iterates are assumed\nto form linear combinations of the previous p iterates and gradients, and the coef\ufb01cients of these\nlinear combinations are assumed to be either stationary (i.e., remain \ufb01xed throughout the optimization\n\u221a\nprocess) or oblivious. Based on this structural assumption, they showed that the iteration complexity\nof minimizing smooth and strongly convex functions is \u02dc\u2126(\n\u03ba ln(1/\u0001)). The fact that this lower\nbound is stronger than (3), in the sense that it does not depend on the dimension, con\ufb01rms that\ncontrolling the functional form of the update rules allows one to derive tighter lower bounds. The\nframework of p-CLIs forms the nucleus of our formulation below.\n\n2.2 De\ufb01nitions\n\nWhen considering lower bounds one must be very precise as to the scope of optimization algorithms\nto which they apply. Below, we give formal de\ufb01nitions for oblivious stochastic CLI optimization algo-\nrithms and iteration complexity (which serves as a crude proxy for their computational complexity).\nDe\ufb01nition 1 (Class of Optimization Problems). A class of optimization problems is an ordered triple\n(F,I,Of ), where F is a family of functions de\ufb01ned over some domain designated by domF, I is\nthe side-information given prior to the optimization process and Of is a suitable oracle which upon\nreceiving x \u2208 domF and \u03b8 in the parameter set \u0398, returns Of (x, \u03b8) \u2286 dom(F) for a given f \u2208 F\n(we shall omit the subscript in Of when f is clear from the context).\nFor example, in FSM, F contains functions as de\ufb01ned in (1), the side-information contains the\nsmoothness parameter L, the strong convexity parameter \u00b5 and the number of components n (although\nit carries a crucial effect on the iteration complexity, e.g., [5], in this work, we shall ignore the side-\ninformation and assume that all the parameters of the class are given). We shall assume that both\n\ufb01rst-order and coordinate-descent oracles (see 10,11 below) are allowed to be used during the\noptimization process. Formally, this is done by introducing an additional parameter which indicates\nwhich oracle is being addressed. This added degree of freedom does not violate our lower bounds.\nWe now turn to rigorously de\ufb01ne CLI optimization algorithms. Note that, compared with the de\ufb01nition\nof \ufb01rst-order p-CLIs provided in [5], here, in order to handle coordinate-descent and \ufb01rst-order oracles\nin a uni\ufb01ed manner, we base our formulation on general oracle procedures.\nDe\ufb01nition 2 (CLI). An optimization algorithm is called a Canonical Linear Iterative (CLI) opti-\nmization algorithm over a class of optimization problems (F,I,Of ), if given an instance f \u2208 F\nand initialization points {w(0)\ni }i\u2208J \u2286 dom(F), where J is some index set, it operates by iteratively\ngenerating points such that for any i \u2208 J ,\nOf\n\n\u2208(cid:88)\n\nk = 0, 1, . . .\n\nw(k+1)\n\n(cid:16)\n\n(cid:17)\n\nw(k)\n\nj\n\n; \u03b8(k)\nij\n\n,\n\ni\n\nj\u2208J\n\n(4)\n\nij \u2208 \u0398 are parameters chosen, stochastically or deterministically, by the algorithm,\nholds, where \u03b8(k)\npossibly depending on the side-information. If the parameters do not depend on previously acquired\noracle answers, we say that the given algorithm is oblivious. Lastly, algorithms with |J | \u2264 p, for\nsome p \u2208 N, are denoted by p-CLI.\n\nij \u2208 \u0398\nNote that assigning different weights to different terms in (4) can be done through \u03b8(k)\n(e.g., oracle 10 below). This allows a succinct de\ufb01nition for obliviousness. Lastly, we de\ufb01ne\niteration complexity.\nDe\ufb01nition 3 (Iteration Complexity). The iteration complexity of a given CLI w.r.t. a given problem\nclass (F,I,Of ) is de\ufb01ned to be the minimal number of iterations K such that\n\nE[f (w(k)\n\n1 ) \u2212 min\n\nw\u2208domF f (w)] < \u0001,\n\n\u2200f \u2208 F, k \u2265 K\n\nwhere the expectation is taken over all the randomness introduced into the optimization process\n(choosing w(k)\n\n1 merely serves as a convention and is not necessary for our bounds to hold).\n\n4\n\n\f2.3 Proof Technique - Deriving Lower Bounds via Approximation Theory\n\nConsider the following parametrized class of L-smooth and \u00b5-strongly convex optimization problems,\n\n(5)\nClearly, the minimizer of f\u03b7 are w\u2217(\u03b7) := 1/\u03b7, with norm bounded by 1/\u00b5. For simplicity, we will\nconsider a special case, namely, vanilla gradient descent (GD) with step size 1/L, which produces\nnew iterates as follows\n\nw\u2208R f\u03b7(w) :=\nmin\n\n2\n\n\u03b7w2\n\n\u2212 w,\n\n\u03b7 \u2208 [\u00b5, L].\n\nw(k+1)(\u03b7) = w(k)(\u03b7) \u2212 1\nL\n\nf(cid:48)\n\u03b7(w(k)(\u03b7)) =\n\n1 \u2212 \u03b7\nL\n\nw(k)(\u03b7) +\n\n1\nL\n\n.\n\nSetting the initialization point to be w(0)(\u03b7) = 0, we derive an explicit expression for w(k)(\u03b7):\n\n(cid:17)\n\n(cid:16)\n(cid:19)\n\n(cid:18) k\n\ni + 1\n\nk\u22121(cid:88)\n\ni=0\n\nw(k)(\u03b7) =\n\n1\nL\n\n(\u22121)i\n\n(\u03b7/L)i.\n\n(6)\n\nFigure 2: The \ufb01rst four iterates of GD and AGD, which form polynomials in \u03b7, the parameter of\nproblem (5), are compared to 1/\u03b7 over [1, 4].\n\nIt turns out that each w(k)(\u03b7) forms a univariate polynomial whose degree is at most k. Furthermore,\nsince f\u03b7(w) are L-smooth \u00b5-strongly convex for any \u03b7 \u2208 [\u00b5, L], standard convergence analysis for\n2 |w\u2217(\u03b7)|,\nGD (e.g., [13], Theorem 2.1.14) guarantees that |w(k)(\u03b7) \u2212 w\u2217(\u03b7)| \u2264 (1 \u2212 2/(1 + \u03ba)) k\nwhere \u03ba denotes the condition number. Substituting Equation (6) for w(k)(\u03b7) yields\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nL\n\nk\u22121(cid:88)\n\ni=0\n\nmax\n\u03b7\u2208[\u00b5,L]\n\n(cid:18) k\n\n(cid:19)\n\ni + 1\n\n(\u22121)i\n\n(\u03b7/L)i \u2212 1/\u03b7\n\n(cid:18)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 1\n\n\u00b5\n\n(cid:19) k\n\n2\n\n.\n\n1 \u2212 2\n\n1 + \u03ba\n\nThus, we see that the faster the convergence rate of a given optimization algorithm is, the better\nthe induced sequence of polynomials (w(k)(\u03b7))k\u22650 approximate 1/\u03b7 w.r.t.\nthe maximum norm\n(cid:107) \u00b7 (cid:107)L\u221e([\u00b5,L]) over [\u00b5, L]. In Fig. 2, we compare the \ufb01rst 4 polynomials induced by GD and AGD.\nNot surprisingly, AGD polynomials approximates 1/\u03b7 better than those of GD.\nNow, one may ask, assuming that iterates of a given optimization algorithm A for (5) can be\nexpressed as polynomials sk(\u03b7) whose degree does not exceed the iteration number, just how fast\ncan these iterates converge to the minimizer? Since the convergence rate is bounded from below by\n(cid:107)sk(\u03b7) \u2212 1/\u03b7(cid:107)L\u221e([\u00b5,L]), we may address the following question instead:\n\n(7)\nwhere Pk denotes the set of univariate polynomials whose degree does not exceed k. Problem (7) and\nother related settings are main topics of study in approximation theory. Accordingly, our technique\n\nmin\ns(\u03b7)\u2208Pk\n\n(cid:107)s(\u03b7) \u2212 1/\u03b7(cid:107)L\u221e([\u00b5,L]),\n\n5\n\n11.522.533.540.20.30.40.50.60.70.80.91ApproximatingpolynomialsGD, w(1)(2)GD, w(2)(2)GD, w(3)(2)GD, w(4)(2)AGD, w(1)(2)AGD, w(2)(2)AGD, w(3)(2)AGD, w(4)(2)1/2\ffor proving lower bounds makes an extensive use of tools borrowed from this area. Speci\ufb01cally, in a\npaper from 1899 [21] Chebyshev showed that\n\n(cid:13)(cid:13)(cid:13)(cid:13)s(\u03b7) \u2212 1\n\n\u03b7 \u2212 c\n\n(cid:13)(cid:13)(cid:13)(cid:13)L\u221e([\u22121,1])\n\nmin\ns(\u03b7)\u2208Pk\n\n\u2265 (c \u2212 \u221a\nc2 \u2212 1\n\nc2 \u2212 1)k\n\n,\n\nc > 1,\n\n(8)\n\n(cid:90) L\n\n\u221a\n\n\u03ba ln(1/\u0001)).\n\nby which we derive the following theorem (see Appendix A.1 for a detailed proof).\nTheorem 1. The number of iterations required by A to get an \u0001-optimal solution is \u02dc\u2126(\nIn the following sections, we apply oblivious CLI on various parameterized optimization problems\nso that the resulting iterates are polynomials in the problem parameters. We then apply arguments\nsimilar to the above\nA similar reduction, from optimization problems to approximation problems, was used before in a\nfew contexts to analyze the iteration complexity of deterministic CLIs (e.g., [5, Section 3], see also\nConjugate Gradient convergence analysis [14]). But, what if we allow random algorithms? should we\nexpect the same iteration complexity? To answer this, we use Yao\u2019s minimax principle according to\nwhich the performance of a given stochastic optimization algorithm w.r.t. its worst input are bounded\nfrom below by the performance of the best deterministic algorithm w.r.t. distributions over the input\nspace. Thus, following a similar reduction one can show that the convergence rate of stochastic\nalgorithms is bounded from below by\n\nmin\ns(\u03b7)\u2208Pk\n\n\u00b5\n\n|s(\u03b7) \u2212 1/\u03b7|\n\n1\n\nL \u2212 \u00b5\n\nd\u03b7.\n\n(9)\n\nThat is, a lower bound for the stochastic case can be attained by considering an approximation\nproblem w.r.t. weighted L1 with the uniform distribution over [\u00b5, L]. Other approximation problems\nconsidered in this work involve L2-norm and different distributions. We provide a schematic\ndescription of our proof technique in Scheme 2.1.\n\nSCHEME 2.1\nGIVEN\n\nCHOOSE\nCOMPUTE\nBOUND\n\nFROM OPTIMIZATION PROBLEMS TO APPROXIMATION PROBLEMS\nA CLASS OF FUNCTIONS F , A SUITABLE ORACLE O\nAND A SEQUENCE OF SETS OF FUNCTION Sk OVER SOME PARAMETERS SET H.\nA SUBSET OF FUNCTIONS {f\u03b7 \u2208 F|\u03b7 \u2208 H}, S.T. wk(\u03b7) \u2208 Sk.\nTHE MINIMIZER w\u2217(\u03b7) FOR ANY f\u03b7\nFROM BELOW THE BEST APPROXIMATION FOR w\u2217(\u03b7) W.R.T. Sk\nAND A NORM (cid:107) \u00b7 (cid:107), I.E., min{(cid:107)s(\u03b7) \u2212 w\u2217(\u03b7)(cid:107) | s(\u03b7) \u2208 Sk}\n\n3 Lower Bound for Finite Sums Minimization Methods\n\nHaving described our analytic approach, we now turn to present some concrete applications, starting\nwith iteration complexity lower bounds in the context of FSM problems (1). In what follows, we\nderive a lower bound on the iteration complexity of oblivious (possibly stochastic) CLI algorithms\nequipped with \ufb01rst-order and coordinate-descent oracles for FSM. Strictly speaking, we focus on\noptimization algorithms equipped with both generalized \ufb01rst order oracle,\n\nO(w; A, B, c, j) = A\u2207fj(w) + Bw + c, A, B \u2208 Rd\u00d7d, c \u2208 Rd, j \u2208 [n],\n\n(10)\n\nand steepest coordinate-descent oracle\nO(w; i, j) = w + t\u2217ei,\n\nt\u2217 \u2208 argmin\nt\u2208R\n\nfj(w1, . . . , wi\u22121, wi + t, wi+1, . . . , wd), j \u2208 [n],\n\n(11)\n\nwhere ei denotes the i\u2019th unit vector. We remark that coordinate-descent steps w.r.t. partial gradients\ncan be implemented using (10) by setting A to be some principal minor of the unit matrix. It should\nbe further noted that our results below hold for scenarios where the optimization algorithm is free to\ncall a different oracle at different iterations.\nFirst, we sketch the proof of the lower bound for deterministic oblivious CLIs. Following Scheme\n2.1, we restrict our attention to a parameterized subset of problems. We assume2 d > 1 and denote by\n2Clearly, in order to derive a lower bound for coordinate-descent algorithms, we must assume d > 1. If only\n\na \ufb01rst-order oracle is allowed, then the same lower bound as in Theorem 2 can be derived for d = 1.\n\n6\n\n\fHFSM the set of all (\u03b71, . . . , \u03b7n) \u2208 Rn such that all the entries equal \u2212(L \u2212 \u00b5)/2, except for some\nj \u2208 [n], for which \u03b7j \u2208 [\u2212(L \u2212 \u00b5)/2, (L \u2212 \u00b5)/2]. Now, given \u03b7 := (\u03b71, . . . , \u03b7n) \u2208 HFSM we de\ufb01ne\n\n(cid:18) 1\n\nn(cid:88)\n\n2\n\ni=1\n\nL+\u00b5\n\n2\n\u03b7i\n\nF\u03b7(w) :=\n\nQ\u03b7i :=\n\n1\nn\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(cid:19)\n\nw(cid:62)Q\u03b7iw \u2212 q(cid:62)w\n\n\u03b7i\nL+\u00b5\n\n2\n\n\u00b5\n\n...\n\n\u00b5\n\nIt is easy to verify that the minimizers of (12) are\n\n\uf8eb\uf8ed\n\nw\u2217(\u03b7) =\n\n(cid:16) L+\u00b5\n\nR\u00b5\n2 + 1\n\nn\n\n(cid:80)n\n\ni=1 \u03b7i\n\n(cid:17) ,\n\n(cid:16) L+\u00b5\n\nR\u00b5\n2 + 1\n\nn\n\n\u221a\n\n2\n\n\u221a\n\n2\n\n, where\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 , q :=\n\n(12)\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nR\u00b5\u221a\n2\nR\u00b5\u221a\n2\n0\n\n...\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 .\n\n0\n\n\uf8f6\uf8f8(cid:62)\n(cid:17) , 0, . . . , 0\n\n(cid:80)n\n\ni=1 \u03b7i\n\n.\n\n(13)\n\ni\n\nWe would like to show that the coordinates of the iterates of deterministic oblivious CLIs, which\nminimize F\u03b7 using \ufb01rst-order and coordinate-descent oracles, form multivariate polynomials in \u03b7\nof total degrees (the maximal sum of powers over all the terms) which does not exceed the iteration\nnumber. Indeed, if the coordinates of w(k)\n(\u03b7) are multivariate polynomial in \u03b7 of total degree at\nmost k, then the coordinates of the vectors returned by both oracles\nFirst-order oracle: O(w(k)\n(14)\ni \u2212 qi/(Q\u03b7j )iiei,\nCoordinate-descent oracle: O(w(k)\nare multivariate polynomials of total degree of at most k + 1, as all the parameters (A, B, C, i and j)\ndo not depend on \u03b7 (due to obliviousness) and the rest of the terms (Q\u03b7j , q, I, 1/(Q\u03b7j )ii, (Q\u03b7j )i,\u2217, ei\nand qi) are either linear in \u03b7j or constants. Now, since the next iterates are generated simply by\nsumming up all the oracle answers, they also form multivariate polynomials of total degree of at most\nk + 1. Thus, denoting the \ufb01rst coordinate of w(k)\n1 (\u03b7) by s(\u03b7) and using Inequality (8), we get the\nfollowing bound\n\n; i, j) =(cid:0)I \u2212 (1/(Q\u03b7j )ii)ei(Q\u03b7j )i,\u2217(cid:1) w(k)\n\n; A, B, c, j) = A(Q\u03b7j w(k)\n\ni \u2212 q) + Bw(k)\n\ni + c,\n\nj\n\nj\n\n(cid:107)w(k)\n\n1 (\u03b7) \u2212 w\u2217(\u03b7)(cid:107) \u2265\n\nmax\n\u03b7\u2208HFSM\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)L\u221e([\u00b5,L])\n\n(cid:17)\n\ni=1 \u03b7i\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)s(\u03b7) \u2212\n(cid:16) L+\u00b5\nR\u00b5\n(cid:113) \u03ba\u22121\n\uf8eb\uf8ed\n2 + 1\n(cid:113) \u03ba\u22121\nn + 1 \u2212 1\nn + 1 + 1\n\n2\n\nn\n\n(cid:80)n\n\uf8f6\uf8f8k/n\n\n,\n\n\u2265 \u2126(1)\n\n(15)\n\n(16)\n\nwhere \u2126(1) designates a constant which does not depend on k (but may depend on the problem\nparameters). Lastly, this implies that for any deterministic oblivious CLI and any iteration number,\nthere exists some \u03b7 \u2208 HFSM such that the convergence rate of the algorithm, when applied on F\u03b7,\nis bounded from below by Inequality (16). We note that, as opposed to other related lower bounds,\ne.g., [10], our proof is non-constructive. As discussed in subsection 2.3, this type of analysis can be\nextended to stochastic algorithms by considering (15) w.r.t. other norms such as weighted L1-norm.\nWe now arrive at the following theorem whose proof, including the corresponding logarithmic factors\nand constants, can be found in Appendix A.2.\nTheorem 2. The iteration complexity of oblivious (possibly stochastic) CLIs for FSM (1) equipped\nwith \ufb01rst-order (10) and coordinate-descent oracles (11), is bounded from below by\n\n\u02dc\u2126(n +(cid:112)n(\u03ba \u2212 1) ln(1/\u0001)).\n\nThe lower bound stated in Theorem 2 is tight and is attained by, e.g., SAG combined with an\nacceleration scheme (e.g., [11]). Moreover, as mentioned earlier, our lower bound does not depend\non the problem dimension (or equivalently, holds for any number of iterations, regardless of d and\n\n7\n\n\fn), and covers coordinate descent methods with stochastic or deterministic coordinate schedule\n(in the special case where n = 1, this gives a lower bound for minimizing smooth and strongly\nconvex functions by performing steepest coordinate descent steps). Also, our bound implies that\nusing mini-batches for tackling FSM does not reduce the overall iteration complexity. Lastly, it is\nnoteworthy that the n term in the lower bound above holds for any algorithm accompanied with an\nincremental oracle, which grants access to at most one individual function each time.\nWe also derive a nearly-optimal lower bound for smooth non-strongly convex functions for the more\nrestricted setting of n = 1 and \ufb01rst-order oracle. The parameterized subset of functions we use\n\u03b7 \u2208 (0, L]. The corresponding minimizer (as\n(see Scheme 2.1) is g\u03b7(x) := \u03b7\na function of \u03b7) is x\u2217(\u03b7) = Re1, and in this case we seek to approximate it w.r.t. L2-norm using\nk-degree univariate polynomials whose constant term vanishes. The resulting bound is dimension-free\nand improves upon other bounds for this setting (e.g. [5]) in that it applies to deterministic algorithms,\nas well as to stochastic algorithms (see A.3 for proof).\nTheorem 3. The iteration complexity of any oblivious (possibly stochastic) CLI for L-smooth convex\nfunctions equipped with a \ufb01rst-order oracle, is bounded from below by\n\n2 (cid:107)x(cid:107)2 \u2212 R\u03b7e(cid:62)\n1 x,\n\n(cid:16)\n\n(L(\u03b4 \u2212 2)/\u0001)1/\u03b4(cid:17)\n\n\u2126\n\n, \u03b4 \u2208 (2, 4).\n\n4 Lower Bound for Dual Regularized Loss Minimization with Linear\n\nPredictors\n\nThe form of functions (12) discussed in the previous section does not readily adapt to general RLM\nproblems with linear predictors, i.e.,\n\nmin\nw\u2208Rd\n\nP (w) :=\n\n1\nn\n\n\u03c6i((cid:104)xi, w(cid:105)) +\n\n(cid:107)w(cid:107)2 ,\n\n\u03bb\n2\n\n(17)\n\nwhere the loss functions \u03c6i are L-smooth and convex, the samples x1, . . . , xn are d-dimensional\nvectors in Rd and \u03bb is some positive constant. Thus, dual methods which exploit the added structure\nof this setting through the dual problem [18],\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n\u03bbn\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nmin\n\u03b1\u2208Rn\n\nD(\u03b1) =\n\n1\nn\n\ni (\u2212\u03b1i) +\n\u03c6\u2217\n\n\u03bb\n2\n\nxi\u03b1i\n\n,\n\n(18)\n\nsuch as SDCA and accelerated proximal SDCA, are not covered by Theorem 2. Accordingly, in\nthis section, we address the iteration complexity of oblivious (possibly stochastic) CLI algorithms\nequipped with dual RLM oracles:\n\nO(\u03b1; t, j) = \u03b1 + t\u2207jD(\u03b1)ej,\nO(\u03b1; j) = \u03b1 + t\u2217ej,\n\nt\u2217 = argmin\n\nt \u2208 R, j \u2208 [n],\n\nt\u2208R\n\nD(\u03b11, . . . , \u03b1j\u22121, \u03b1j + t, \u03b1j+1, . . . , \u03b1d), j \u2208 [n],\n\n(19)\n\nFollowing Scheme 2.1, we \ufb01rst describe the relevant parametrized subset of RLM problems. For the\nsake of simplicity, we assume that n is even (the proof for odd n holds mutatis mutandis). We denote\nby HRLM the set of all (\u03c81, . . . , \u03c8n/2) \u2208 Rn/2 such that all entries are 0, except for some j \u2208 [n/2],\nfor which \u03c8j \u2208 [\u2212\u03c0/2, \u03c0/2]. Now, given \u03c8 \u2208 HRLM, we set P\u03c8 (de\ufb01ned in 17) as follows\ni is odd\no.w.\n\n(cid:26)cos(\u03c8(i+1)/2)ei + sin(\u03c8(i+1)/2)ei+1\n\n(w + 1)2, x\u03c8,i =\n\n\u03c6i(w) =\n\n1\n2\n\n.\n\nei\n\nWe state below the corresponding lower bound, whose proof, including logarithmic factors and\nconstants, can be found in Appendix A.4.\nTheorem 4. The iteration complexity of oblivious (possibly stochastic) CLIs for RLM (17) equipped\nwith dual RLM oracles (19) is bounded from below by\n\n\u02dc\u2126(n +(cid:112)nL/\u03bb ln(1/\u0001)).\n\nThis bound is tight w.r.t. the class of oblivious CLIs and is attained by accelerated proximal SDCA. As\nmentioned earlier, a tighter lower bound of \u02dc\u2126((n + 1/\u03bb) ln(1/\u0001)) is known for SDCA [3], suggesting\nthat a tighter bound might hold for the more restricted set of stationary CLIs (for which the oracle\nparameters remain \ufb01xed throughout the optimization process).\n\n8\n\n\f", "award": [], "sourceid": 1772, "authors": [{"given_name": "Yossi", "family_name": "Arjevani", "institution": "Weizmann Institute of Science"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}