{"title": "Variance-Reduced Stochastic Gradient Descent on Streaming Data", "book": "Advances in Neural Information Processing Systems", "page_first": 9906, "page_last": 9915, "abstract": "We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the sub-optimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.", "full_text": "Variance-Reduced Stochastic Gradient Descent\n\non Streaming Data\n\nEllango Jothimurugesan\u2217\u2020\nCarnegie Mellon University\nejothimu@cs.cmu.edu\n\nPhillip B. Gibbons\u2020\n\nCarnegie Mellon University\n\ngibbons@cs.cmu.edu\n\nAshraf Tahmasbi\u2217\u2021\nIowa State University\n\ntahmasbi@iastate.edu\n\nSrikanta Tirthapura\u2021\nIowa State University\nsnt@iastate.edu\n\nAbstract\n\nWe present an algorithm STRSAGA that can ef\ufb01ciently maintain a machine learning\nmodel over data points that arrive over time, and quickly update the model as new\ntraining data are observed. We present a competitive analysis that compares the sub-\noptimality of the model maintained by STRSAGA with that of an of\ufb02ine algorithm\nthat is given the entire data beforehand. Our theoretical and experimental results\nshow that the risk of STRSAGA is comparable to that of an of\ufb02ine algorithm on a\nvariety of input arrival patterns, and its experimental performance is signi\ufb01cantly\nbetter than prior algorithms suited for streaming data, such as SGD and SSVRG.\n\n1\n\nIntroduction\n\nWe consider the maintenance of a model over streaming data that are arriving as an endless sequence\nof data points. At any point in time, the goal is to \ufb01t the model to the training data points observed\nso far, in order to accurately predict/label unobserved test data. Such a model is never \u201ccomplete\u201d\nbut instead needs to be continuously updated as newer training data points arrive. Methods that\nrecompute the model from scratch upon the arrival of new data points are infeasible due to their high\ncomputational costs, and hence we need methods that ef\ufb01ciently update the model as more data arrive.\nSuch ef\ufb01ciency should not come at the expense of accuracy\u2014the accuracy of the model maintained\nthrough such updates should be close to that obtained if we were to build a model from scratch, using\nall the training data points seen so far.\nFitting a model is usually cast as an optimization problem, where the model parameters are those that\noptimize an objective function. In typical cases, the objective function is the empirical or regularized\nrisk, usually the sum of a \ufb01nite number of terms, and often assumed to be convex. Consider a stream\nof training data points Si arriving before or at time i consisting of ni data points. Let w denote the\nset of parameters characterizing the learned function. The empirical risk function RSi measures the\naverage loss of w over Si: RSi(w) = 1\nj=1 fj(w), where fj(w) is the loss of w on data point j.\nThe goal is to \ufb01nd the empirical risk minimizer (ERM), i.e., the parameters w\u2217 that minimize the\nempirical risk over all data points observed so far. Typically, some form of gradient descent is used in\npursuit of w\u2217.\nThere are two common approaches: batch learning and incremental learning (sometimes called\n\u201conline learning\u201d) [BL03, Ber16]. Batch learning uses all available data points in the training set to\n\n(cid:80)ni\n\nni\n\n\u2217EJ and AT contributed equally to this work.\n\u2020Supported in part by NSF grant 1725663\n\u2021Supported in part by NSF grants 1527541 and 1725702\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcompute the gradient for each step of gradient descent\u2014this method renders a gradient computation\nto be expensive, especially for a large dataset. In contrast, an incremental learning algorithm operates\non only a single data point at each step of gradient descent, and hence a single step of an incremental\nalgorithm is much faster than a corresponding step of a batch algorithm. Incremental algorithms,\ne.g., Stochastic Gradient Descent (SGD) [RM51, BL03] and variance-reduced improvements such\nas SVRG [JZ13] and SAGA [DBLJ14], have been found to be more effective on large datasets than\nbatch algorithms, and are widely used.\nBoth batch and incremental algorithms assume that all training data points are available in advance\u2014\nwe refer to such algorithms as of\ufb02ine algorithms. However, in the setting that we consider, data points\narrive over time according to an unknown arrival distribution, and neither batch nor incremental\nalgorithms are able to update the model ef\ufb01ciently as more data arrives. Though incremental learning\nalgorithms use only a single data point in each iteration, they typically select that point from the\nentire set of training data points\u2014this set of training data points is constantly changing in our setting,\nrendering incremental algorithms inapplicable. In the rest of the paper, we refer to an algorithm\nthat can ef\ufb01ciently update a model upon the arrival of new training data points as a streaming data\nalgorithm. Note that streaming data algorithms (which are not limited in their memory usage) are\nbroader than traditional streaming algorithms (which work in a single pass with limited memory).\nStreaming data algorithms are relevant in many practical settings given the abundance of memory\nthese days.\nThe optimization goal of a streaming data algorithm is to maintain a model using all the data points\nthat have arrived so far, such that the model\u2019s empirical risk is close to the ERM over those data\npoints. The challenges include (i) because the training data is changing at each time step, the ERM\non streaming data is a \u201cmoving target\u201d; (ii) the ERM is an optimal solution that cannot be realized\nin limited processing time, while a streaming data algorithm is not only limited in processing time,\nbut is also presented the data points only sequentially; (iii) with increasing arrival rates, it becomes\nincreasingly dif\ufb01cult for the streaming data algorithm to keep up with the ERM; and (iv) data points\nmay not arrive at a steady rate: the numbers of points arriving at different points in time can be\nhighly skewed. We present and analyze a streaming data algorithm, STRSAGA, that overcomes these\nchallenges and achieves an empirical risk close to the ERM in a variety of settings.\nContributions. We present STRSAGA, a streaming data algorithm for maintaining a model. STRSAGA\nsees data points in a sequential manner, and can ef\ufb01ciently incorporate newly arrived data points into\nthe model. Yet, its accuracy at each point in time is comparable to that of an of\ufb02ine algorithm that has\naccess to all training data points in advance. We prove this using a \u201ccompetitive analysis\u201d framework\nthat compares the accuracy of STRSAGA to a state-of-the-art of\ufb02ine algorithm DYNASAGA [DLH16],\nwhich is based on variance-reduced SGD. We show that given the same computational power, the\naccuracy of STRSAGA is competitive to DYNASAGA at each point in time, under certain conditions on\nthe schedule of input arrivals. Our notion of \u201crisk-competitiveness\u201d is based on the sub-optimality\nof risk with respect to the ERM (\u201csub-optimality\u201d is de\ufb01ned in Section 3). Our theoretical analysis\nrelies on a connection between the \u201ceffective sample size\u201d of the algorithm (whether a streaming data\nalgorithm or an of\ufb02ine algorithm) and its sub-optimality of risk with respect to the ERM. We show\nthat if a streaming data algorithm is \u201csample-competitive\u201d to an of\ufb02ine algorithm, i.e., its effective\nsample size is close to that of an of\ufb02ine algorithm, then it is also risk-competitive to the of\ufb02ine\nalgorithm.\nA key aspect of our work is that we carefully consider the schedule of arrivals of the data points\u2014we\ncare not only about which training data points have arrived so far, and how many of them, but\nalso about when they arrived. In our setting where the streaming data algorithm is computationally\nbounded, it is not possible to be always risk-competitive with an of\ufb02ine algorithm. However, we show\nthat it is possible to achieve risk-competitiveness, if the schedule of arrivals of training data points\nobeys certain conditions that we lay out in Section 5. We show that these conditions are satis\ufb01ed by\na number of common arrival distributions, including Poisson arrivals and many classes of skewed\narrivals. For all these arrival distributions, we show that STRSAGA is risk-competitive to DYNASAGA.\nOur experimental results for two machine learning tasks, logistic regression and matrix factorization,\non two real data sets each, support our analytical \ufb01ndings: the sub-optimality of STRSAGA on data\npoints arriving over time (according to a variety of input arrival distributions) is almost always compa-\nrable to the of\ufb02ine algorithm DYNASAGA that is given all data points in advance, when each algorithm\nis given the same computational power. We also show that STRSAGA signi\ufb01cantly outperforms natural\n\n2\n\n\fstreaming data versions of both SGD and SSVRG [FGKS15]. Moreover, the update time of STRSAGA\nis small, making it practical even for settings when the arrival rate is high.\n\n2 Related work\n\nStochastic Gradient Descent (SGD) [RM51] and its extensions are used extensively in practice for\nlearning from large datasets. While an iteration of SGD is cheap relative to an iteration of a full\ngradient method, its variance can be high. To control the variance, the learning rate of SGD must\ndecay over time, resulting in a sublinear convergence rate. Newer variance-reduced versions of SGD,\non the other hand, achieve linear convergence on strongly convex objective functions, generally by\nincorporating a correction term in each update step that approximates a full gradient, while still\nensuring each iteration is ef\ufb01cient like SGD.\nSAG [RSB12] was among the \ufb01rst variance reduction methods proposed and achieves linear con-\nvergence rate for smooth and strongly convex problems. SAG requires storage of the last gradient\ncomputed for each data point and uses their average in each update. SAGA [DBLJ14] improves on\nSAG by eliminating a bias in the update. Stochastic Variance-Reduced Gradient (SVRG) [JZ13]\nis another variance reduction method that does not store the computed gradients, but periodically\ncomputes a full-data gradient, requiring more computation than SAGA. Semi-Stochastic Gradient De-\nscent (S2GD) [KR13] is a variant of SVRG where the gaps between full-data gradient computations\nare of random length and follow a geometric law. CHEAPSVRG [SAKS16] is another variant of\nSVRG. In contrast with SVRG, it estimates the gradient through computing the gradient on a subset\nof training data points rather than all the data points. However, all of the above variance-reduced\nmethods require O(n log n) iterations to guarantee convergence to statistical accuracy (to yield a\ngood \ufb01t to the underlying data) for n data points. DYNASAGA [DLH16] achieves statistical accuracy\nin only O(n) iterations by using a gradually increasing sample set and running SAGA on it.\nSo far, all the algorithms we have discussed are of\ufb02ine, and assume the entire dataset is available\nbeforehand. Streaming SVRG (SSVRG) [FGKS15] is an algorithm that handles streaming data arrivals,\nand processes them in a single pass through data, using limited memory. In our experimental study,\nwe found STRSAGA to be signi\ufb01cantly more accurate than SSVRG. Further, our analysis of STRSAGA\nshows that it handles arrival distributions which allow for burstiness in the stream, while SSVRG is\nnot suited for this case. In many practical situations, restricting a streaming data algorithm to use\nlimited memory is overly restrictive and as our results show, leads to worse accuracy.\n\n3 Model and preliminaries\n\nWe consider a data stream setting in which the training data points arrive over time. For i =\n1, 2, 3, . . . , let Xi be the set of zero or more training data points arriving at time step i. We assume\nthat each training data point is drawn from a \ufb01xed distribution P, which is not known to the algorithm.\nDealing with distributions that change over time is beyond the scope of this paper. Let Si = \u222ai\nj=1Xj\ndenote the set of data points that have arrived in time steps 1 through i (inclusive). Let ni denote the\nnumber of data points in Si.\nThe model being trained/maintained is drawn from a class of functions F. A function in this class\nis parameterized by a vector of weights w \u2208 Rd. For a function w, we de\ufb01ne its expected risk as\nR(w) = E [fx(w)] where fx(w) is the loss of function w on input x and the expectation is taken\nover x drawn from distribution P. Let function w\u2217 = arg minw\u2208FR(w) denote the optimal function\nwith respect to R(w). Let R\u2217 = R(w\u2217) denote the minimum expected risk possible over distribution\nP, within function class F. The function w\u2217 is called the distributional risk minimizer. Given a\nsample S of training data points drawn from P, the best we can do is minimize the empirical risk\nover this sample. We have analogous de\ufb01nitions for minimizers of empirical risk over this sample.\nThe empirical risk of function w over a sample S of n elements is: RS (w) = 1\nx\u2208S fx(w). The\nS = arg minw\u2208FRS (w). The optimal\noptimizer of the empirical risk is denoted as w\u2217\nS, de\ufb01ned as w\u2217\nempirical risk is R\u2217\ni = w\u2217\n.\nSi\nSimilarly, the optimal empirical risk over Si is R\u2217\ni = RSi(w\u2217\ni ).\nSuppose a risk minimization algorithm is given a set of training examples Si, and outputs ap-\nproximate solution wi. The statistical error is E [R(w\u2217\ni ) \u2212 R(w\u2217)] and the optimization error is\n\nS ). We denote the optimizer of the empirical risk over Si as w\u2217\n\nS = RS (w\u2217\n\n(cid:80)\n\nn\n\n3\n\n\fi )], where the expectation is taken over the randomness of Si. The total error\n\nE [R(wi) \u2212 R(w\u2217\n(restricting to a \ufb01xed function class F) is the sum of the two.\nFollowing Bottou and Bousquet [BB07], we de\ufb01ne sub-optimality as follows.\nDe\ufb01nition 1. The sub-optimality of an algorithm A over training data S is the difference between\nA\u2019s empirical risk and the optimal empirical risk:\n\nSUBOPTS (A) := RS (w) \u2212 RS (w\u2217\nS )\n\nS is the empirical risk minimizer over S.\n\nwhere w is the solution returned by A on S and w\u2217\nLet H(n) = cn\u2212\u03b1, for a constant c and 1/2 \u2264 \u03b1 \u2264 1, be an upper bound on the statistical error.\nBottou and Bousquet [BB07] show that if \u0001 is a bound on the sub-optimality of w on Si, then the\ntotal error is bounded by H(ni) + \u0001. Therefore, in designing an ef\ufb01cient algorithm for streaming\ndata, we focus on reducing the sub-optimality to asymptotically balance with H(ni) \u2014 it does not\npay to reduce the empirical risk even further. Note that although H(ni) is only an upper bound on\nthe statistical error, Bottou and Bousquet remark \u201cit is often accepted that these upper bounds give a\nrealistic idea of the actual convergence rates\u201d [BB07], in which case balancing the sub-optimality\nwith H(ni) asymptotically minimizes the total error.\nWe focus on time-ef\ufb01cient algorithms for maintaining a model over streaming data. We focus on a\nbasic step used in all SGD-style algorithms (or variants such as SAGA): A random training point x is\nchosen from a set of training samples, and the vector w is updated through a gradient computed at\npoint x. Let \u03c1 \u2265 1 denote the number of such basic steps that can be performed in a single time step.\n\n4 STRSAGA: gradient descent over streaming data\n\nWe present our algorithm STRSAGA for learning from streaming data. Stochastic gradient descent (or\none of its variants, such as SAGA [DBLJ14]) works by repeatedly sampling a point from a training\nset T and using its gradient to determine an update direction. One option to handle streaming data\narrivals is to simply expand the set T from which further sampling is conducted, by adding all the new\narrivals. However, the problem with this approach is that the size of the training set T can change in\nan uncontrolled manner, depending on the number of arrivals. As illustrated in prior work [DLH16],\nthe optimization error of SAGA increases with the size of the training set T . With an uncontrolled\nincrease in the size of T , the corresponding sub-optimality of the algorithm over T increases, so that\nthe function that is \ufb01nally computed may have poor accuracy.\nTo handle this, we use an idea from DYNASAGA [DLH16], which increases the size of the training set\nT in a controlled manner, according to a schedule. Upon increasing the size of T , further increases\nare placed on hold until a suf\ufb01cient number of SAGA steps have been performed on the current\nstate of T . By using this idea, DYNASAGA was able to achieve statistical accuracy earlier than SAGA.\nHowever, DYNASAGA is still an of\ufb02ine algorithm that assumes that all training data is available in\nadvance.\nSTRSAGA deals with streaming arrivals as follows. Arriving points from the next set of points Xi\nare added to a buffer Buf. The effective sample set T is expanded in a controlled manner, similar to\nDYNASAGA. However, instead of choosing new points from a static training set, such as in DYNASAGA,\nSTRSAGA chooses new points from the dynamically changing buffer Buf. If Buf is empty, then\navailable CPU cycles are used to perform further steps of SAGA. After any time step, it is possible\nthat STRSAGA may have trained over only a subset of the points that are available in Buf, but this is to\nensure that the optimization error on the subset that has been trained is balanced with the statistical\nerror of the effective sample size. Algorithm 1 depicts the steps taken to process the zero or more\npoints Xi arriving at time step i. Before any input is seen, the algorithm initializes buffer Buf to\nempty, effective sample T0 to empty, and function w0 to random values. STRSAGA as described here\nuses the basic framework of DYNASAGA, of adding one training point to Ti every two steps of SAGA\n(the linear schedule in [DLH16]), and both algorithms borrow variance-reduction steps from SAGA\n(lines 8-9 in Algorithm 1 and using the running average A of all gradients).\nAnalysis of STRSAGA: Suppose data points Si have been seen till time step i, and ni = |Si|. We \ufb01rst\nnote that the time taken to process a set of training points Xi is dominated by the time taken for \u03c1\niterations of SAGA. Ideally, the empirical risk of the solution returned by STRSAGA is close to the\nempirical risk of the ERM over Si. However, this is not possible in general. Suppose the number\n\n4\n\n\fAlgorithm 1: STRSAGA: Process a set of training points Xi that arrived in time step i, i > 0.\n// wi\u22121 is the current function. Ti\u22121 the effective sample set.\n\n1 Add Xi to Buf // Buf is the set of training points not added to Ti yet\n\n2 (cid:101)w0 \u2190 wi\u22121 and Ti \u2190 Ti\u22121\n\n// Do \u03c1 steps of SAGA at each time step\n3 for j \u2190 1 to \u03c1 do\n\n// Every two steps of SAGA, add one training point to Ti, if available\nif (Buf is non-empty) AND (j is even) then\nMove a single point, z, from Buf to Ti\n\u03b1(z) \u2190 0 // \u03b1(z) the prior gradient of z, initialized to 0\n\nA \u2190(cid:80)\ng \u2190 \u2207fp((cid:101)wj\u22121) // compute the gradient\n(cid:101)wj \u2190 (cid:101)wj\u22121 \u2212 \u03b7(g \u2212 \u03b1(p) + A) // \u03b7 is the learning rate\n\nSAGA, and can be maintained incrementally\n\n\u03b1(p)/|Ti| // A the average of all gradients, used by\n\nSample a point p uniformly from Ti\n\np\u2208Ti\n\n4\n5\n6\n7\n\n8\n9\n10\n11\n\n\u03b1(p) \u2190 g\n\n12 wi \u2190 (cid:101)w\u03c1\n\n// wi is the current function and Ti is the effective sample set.\n\nof points arriving at each time step i were much greater than \u03c1, the number of iterations of SAGA\nthat can be performed at each step. Then not even an of\ufb02ine algorithm such as DYNASAGA that has\nall points at the beginning of time could be expected to match the empirical risk of the ERM within\nthe available time. In what follows, we present a competitive analysis, where the performance of\nSTRSAGA is compared with that of an of\ufb02ine algorithm that has all data available to it in advance. We\nconsider two of\ufb02ine algorithms, ERM and DYNASAGA(\u03c1), described below.\nAlgorithm ERM is the empirical risk minimizer, sees all of Si at the beginning of time, and has\nin\ufb01nite computational power to process it. A streaming data algorithm has two obstacles if it has to\ncompete with ERM: (i) Unlike ERM, a streaming data algorithm does not have all data in advance,\nand (ii) Unlike ERM, a streaming data algorithm has limited computational power. It is clear that no\nstreaming data algorithm can do better than ERM. We can practically approach the performance of\nERM through executing DYNASAGA until convergence is achieved.\nAlgorithm DYNASAGA(\u03c1) sees all of Si at the beginning of time, and is given \u03c1 iterations of gradient\ncomputations in each step. The parenthetical \u03c1 denotes this algorithm is the extension of the\noriginal DYNASAGA [DLH16], parameterized by the available amount of processing time. The\nalgorithm DYNASAGA performs 2ni steps of gradient computations on Si and then terminates, while\nDYNASAGA(\u03c1) performs \u03c1i steps, where if \u03c1i > 2ni, the additional steps are uniformly over Si. The\ncomputational power of DYNASAGA(\u03c1) over i time steps matches that of a streaming data algorithm.\nHowever, DYNASAGA(\u03c1) is still more powerful than a streaming data algorithm, because it can see\nall data in advance. In general, it is not possible for a streaming data algorithm to compete with\nDYNASAGA(\u03c1) either\u2014one issue being that streaming arrivals may be very bursty. Consider the\nextreme case when all of Si arrives in the ith time step, and there were no arrivals in time steps\n1 through i \u2212 1. An algorithm for streaming data has only \u03c1 gradient computation steps that it\ncan perform on ni points, and its earlier \u03c1(i \u2212 1) gradient steps had no data to use. In contrast,\nDYNASAGA(\u03c1) can perform \u03c1i gradient steps on Si, and achieve a smaller empirical risk.\nEach algorithm STRSAGA, DYNASAGA(\u03c1), and ERM, after seeing Si, has trained its model on a subset\nTi \u2286 Si. We call this subset the \u201ceffective sample set\u201d. Let tSTR\ndenote the sizes of\nthe effective sample sets of STRSAGA, DYNASAGA(\u03c1), and ERM, respectively, after i time steps. The\nfollowing lemma shows that the expected sub-optimality of DYNASAGA(\u03c1) over Si is related to tD\ni .\nLemma 1 (Lemma 5 in [DLH16]). After i time steps, tD\n= ni. The\nexpected sub-optimality of DYNASAGA(\u03c1) over Si after i time steps is O(H(tD\n\ni = min{ni, \u03c1i/2}, and tERM\n\ni , and tERM\n\n, tD\n\ni\n\ni\n\ni\n\ni )).\n\nOur goal is for a streaming data algorithm to achieve an empirical risk that is close to the risk of an\nof\ufb02ine algorithm. We present our notion of risk-competitiveness in De\ufb01nition 2.\n\n5\n\n\fDe\ufb01nition 2. For c \u2265 1, a streaming data algorithm I is said to be c-risk-competitive to\nDYNASAGA(\u03c1) at time step i if E [SUBOPTSi (I)] \u2264 cH(tD\ni ). Similarly, I is said to be c-risk-\ncompetitive to ERM at time step i if E [SUBOPTSi(I)] \u2264 cH(ni).\nNote that the expected sub-optimality of I is compared\nwith H(tD\ni ) and H(ni), which are upper bounds on the\nstatistical errors of DYNASAGA(\u03c1) and ERM respectively.\nIf H() is a tight bound on the statistical error, and hence, a\nlower bound on the total error, then c-risk-competitiveness\nto DYNASAGA(\u03c1) implies that the expected sub-optimality\nof the algorithm I is within a factor of c of the total risk\nof DYNASAGA(\u03c1), as illustrated in Figure 1. We next show\nif a streaming data algorithm is risk-competitive with re-\nspect to DYNASAGA(\u03c1) then it is also risk-competitive with\nrespect to ERM, under certain conditions.\nFigure 1: The error of each algorithm.\nLemma 2. If a streaming data algorithm I is c-risk-competitive to DYNASAGA(\u03c1) at time step i, and\nthe statistical risk H(n) = n\u2212\u03b1, then I is c \u00b7 max\n-risk-competitive to ERM at time\n\n(cid:16)(cid:16) 2(cid:101)\u03bbi\n\n(cid:17)\u03b1\n\n(cid:17)\n\n, 1\n\nstep i, where (cid:101)\u03bbi =(cid:0) ni\n\ni\n\n(cid:1) and ni is the size of Si.\n\n\u03c1\n\n(cid:17)\n\nni) = c\n\ni = \u03c1i/2 =\n\nProof. From De\ufb01nition 2 we have: E [SUBOPTSi(I)] \u2264 cH(tD\n(Lemma 1). First consider the case when ni \u2264 \u03c1i/2. We have: E [SUBOPTSi(I)] \u2264 cH(tD\ncH(ni). Therefore, for this case, I is c-risk-competitive to Algorithm ERM.\nIn the other case, when ni > \u03c1i/2, we have: tD\ncH(tD\n\nni. Further, E [SUBOPTSi(I)] \u2264\n(cid:3)\nDiscussion: (cid:101)\u03bbi = (ni/i) is the average rate of arrivals in a time step. We expect the ratio ((cid:101)\u03bbi/\u03c1) to\n\ni = min(ni, \u03c1i/2)\ni ) =\n\n(cid:17)\u03b1 H(ni)\n\n2(cid:101)\u03bbi\ni ) = cH( \u03c1\n\ni ). We know tD\n\n(cid:16) \u03c1\n2(cid:101)\u03bbi\n\n(cid:16) 2(cid:101)\u03bbi\n\nbe a small constant. If this ratio is a large number, much greater than 1, the total number of arrivals\nover i time steps far exceeds the number of gradient computations the algorithm can perform over i\ntime steps. This rate of arrivals is unsustainable, because most practical algorithms such as SGD and\nvariants, including SVRG and SAGA, require more than one gradient computation for each training\npoint. Hence, the above lemma implies that if I is O(1)-risk-competitive to DYNASAGA(\u03c1), then it is\nalso O(1)-risk-competitive to ERM, under reasonable arrival patterns.\nFinally, we will bound the expected sub-optimality of STRSAGA over its effective sample set Ti in\nLemma 3. The proof of this lemma is presented in the supplementary material. In Section 5, we will\nshow how to apply the following result to establish the risk-competitiveness of STRSAGA.\nLemma 3. Suppose all fx are convex and their gradients are L-Lipschitz continuous, and that RTi\nis \u00b5-strongly convex. At the end of each time step i, the expected sub-optimality of STRSAGA over Ti\nis\n\n\u03c1\n\n(cid:18) L\n\n(cid:19)3(cid:18) 1\n\n(cid:19)2\n\n\u00b5\n\ntSTR\ni\n\n.\n\nE [SUBOPTTi(STRSAGA)] \u2264 H(tSTR\n\ni\n\n) + 2 (R(w0) \u2212 R(w\u2217))\n\nIf we additionally assume that the condition number L/\u00b5 is bounded by a constant at each time, the\nabove simpli\ufb01es to E [SUBOPTTi(STRSAGA)] \u2264 (1 + o(1))H(tSTR\n\n).\n\ni\n\n5 Competitive analysis of STRSAGA on speci\ufb01c arrival distributions\n\ni\n\ni\n\n)) (note tSTR\n\nLemma 3 shows that the expected sub-optimality of STRSAGA over its effective sample set Ti is\nO(H(tSTR\nis not equal to ni the number of points so far). However, our goal is to show that\nSTRSAGA is risk-competitive to DYNASAGA(\u03c1) at each time step i; i.e., the expected sub-optimality\nof STRSAGA over Si is within a factor of H(tD\ni ). The connection between the two depends on the\nrelation between tSTR\ni . This relation is captured using sample-competitiveness, which is\nintroduced in this section. Although not every arrival distribution provides sample-competitiveness,\nwe will show a number of different patterns of arrival distributions that do provide this property. To\nmodel different arrival patterns, we consider a general arrival model where the number of points\narriving in time step i is a random variable xi which is independently drawn from distribution P with\na \ufb01nite mean \u03bb. We consider arrival distributions of varying degrees of generality, including Poisson\n\nand tD\n\ni\n\n6\n\n\ud835\udc30\u2217ERM\u210b(\ud835\udc5b\ud835\udc56)\u2265\u210b(\ud835\udc61\ud835\udc56\ud835\udc37)DYNASAGA(\ud835\udf0c)\u2264\ud835\udc50\u210b(\ud835\udc61\ud835\udc56\ud835\udc37)Algorithm I\u211b\u211b\ud835\udc46\fi\n\ni /tD\n\nSTR, Ti\n\nni\n\nSTR\n\nni\n\nSTR and Ti\n\nSTR\n\nH(ti\n\nSTR).\n\nH(ti\n\nSTR) \u2264 (2 + o(1))H(ti\n\nSTR).\n\nD), completing the proof.\n\n. At time i/2, at least kni points have arrived.\n\nSTR) \u2264 (2 + o(1))H(k \u00b7 ti\n\nD) = k\u2212\u03b1(2 + o(1))H(ti\n\n2}-sample-competitive to DYNASAGA(\u03c1) at time step i.\n\nD denote the effective samples that were used at iteration i for STRSAGA and\nD \u2286 Si. Using Theorem 3 from [DLH16], we\n\narrivals, skewed arrivals, general arrivals with a bounded maximum, and general arrivals with an\nunbounded maximum. The proofs of results about speci\ufb01c distributions, as well as the full statements\nof prior theorems and bounds referenced below, can be found in the supplementary material.\ni \u2265 k.\nDe\ufb01nition 3. At time i, STRSAGA is said to be k-sample-competitive to DYNASAGA(\u03c1) if tSTR\nLemma 4. If STRSAGA is k-sample-competitive to DYNASAGA(\u03c1) at time step i, then it is c-risk-\ncompetitive to DYNASAGA(\u03c1) at time step i with c = k\u2212\u03b1(2 + o(1)).\nProof. Let Ti\nDYNASAGA(\u03c1), respectively. We know that Ti\nhave: E [SUBOPTSi(STRSAGA)] \u2264 E [SUBOPTTi(STRSAGA)] + ni\u2212ti\nUsing Lemma 3, we can rewrite the above inequality as:\nE [SUBOPTSi(STRSAGA)] \u2264 (1 + o(1))H(ti\nSTR) + ni\u2212ti\nIf STRSAGA is k-sample-competitive to DYNASAGA(\u03c1), then we have: E [SUBOPTSi(STRSAGA)] \u2264\n(2 + o(1))H(ti\n(cid:3)\nLemma 5. At time step i, suppose the streaming arrivals satisfy: ni/2 \u2265 kni. Then, STRSAGA is\nmin{k, 1\nProof. We \ufb01rst bound tSTR\nIn Algorithm 1, at\ntime i/2, these points are either in the Buf or already in the effective sample T STR\ni/2 . We note that\nfor every two iterations of SAGA, the algorithm moves one point from Buf (if available) to the\neffective sample, thus increasing the size of the effective sample set by 1. In the i/2 time steps from\ni/2 + 1, . . . , i, STRSAGA can perform \u03c1i/2 iterations of SAGA. Within these iterations, it can move\n\u03c1i/4 points to T STR\n, if available in the buffer. Hence, the effective sample size for STRSAGA at time i\nis: tSTR\nWe consider four cases.\ni = ni\ni /2. The other three cases,\nand tSTR\n(2) \u03c1i/4 < ni/2 and ni \u2265 \u03c1i/2, (3) \u03c1i/4 \u2265 ni/2 and ni < \u03c1i/2, and (4) \u03c1i/4 \u2265 ni/2 and ni \u2265 \u03c1i/2,\n(cid:3)\ncan be handled similarly.\nSkewed Arrivals with a Bounded Maximum. We next consider an arrival distribution parameter-\nized by integer M \u2265 \u03bb, where the number of arrivals per time step can either be high (M) or zero.\nMore precisely, xi = M with prob. \u03bb\nM . Thus, E[xi] = \u03bb. For M > \u03bb,\nthis models bursty arrival distributions with a number of \u201cquiet\u201d time steps with no arrivals, combined\nwith an occasional burst of M arrivals. We have the following result for skewed arrivals.\nLemma 6. For a skewed arrival distribution with maximum M and mean \u03bb, STRSAGA is 6\u03b1(2+o(1))-\nrisk-competitive to DYNASAGA(\u03c1), with probability at least 1 \u2212 \u0001, at any time step i > 16M\nAt a high level, the proof relies on showing sample-competitiveness of STRSAGA. For a time step i\ngreater than the threshold stated in the lemma, we can prove the concentration of ni and ni/2 using a\nChernoff bound. Using Lemma 5, STRSAGA and DYNASAGA(\u03c1) are 1\n6-sample-competitive, and the\nrisk-competitiveness follows from Lemma 4. Note that as M increases, arrivals become more bursty,\nand it takes longer for the algorithm to be competitive, with a high con\ufb01dence.\nGeneral Arrivals with a Bounded Maximum. We next consider a more general arrival distribution\nwith a maximum of M arrivals, and a mean of \u03bb. xi = j with probability pj for j = 0, . . . , M, such\n\nIn the \ufb01rst case, (1) if \u03c1i/4 < ni/2 and ni < \u03c1i/2, then tD\n\ni \u2265 \u03c1i/4. In this case, we have tSTR\n\ni \u2265 \u03c1i/4 > ni/2 = tD\n\nM and xi = 0 with prob. 1 \u2212 \u03bb\n\ni \u2265 min{\u03c1i/4, kni}. We know tD\n\ni = min{ni, \u03c1i/2}.\n\ni\n\n\u0001 .\n\u03bb ln 1\n\nthat(cid:80)\n\nj pj = 1 and E[xi] = \u03bb, for an integer M > 0.\n\n3 ) ln 1\n\nLemma 7. For a general arrival distribution with mean \u03bb and maximum M, at any time step\n\u0001 , STRSAGA is 8\u03b1(2 + o(1))-risk-competitive to DYNASAGA(\u03c1), with probability\n\u03bb + 8\ni > ( 16M\nat least 1 \u2212 \u0001.\nThe high-level proof sketch for this case is similar to the case of skewed arrivals. The technical aspect\nis that in order to prove concentration bounds for ni and ni/2, we use Bernstein\u2019s inequality [Mas07],\nwhich lets us bound the sum of independent random variables in a more \ufb02exible manner than Chernoff\nbounds (for random variables that are not necessarily binary valued), in conjunction with a bound on\nthe variance of the distribution. Proof details are in the supplementary material.\nGeneral Arrivals with an Unbounded Maximum. More generally, the number of arrivals in a time\nstep may not have a speci\ufb01ed maximum. The arrival distribution can have a \ufb01nite mean, despite a\n\n7\n\n\fparameter b: The random variable xi has mean \u03bb, variance \u03c32, and |E(cid:2)(xi \u2212 \u03bb)k(cid:3)| \u2264 1\n\nsmall probability of reaching arbitrarily large values. We consider a sub-class of such distributions\nwhere all the polynomial moments are bounded, as in the following Bernstein\u2019s condition with\n2 k!\u03c32bk\u22122\nfor all integers k \u2265 3 [Mas07].\nLemma 8. For any arrival distribution with mean \u03bb, bounded variance \u03c32 and satisfying Bern-\nstein\u2019s condition with parameter b, STRSAGA is 8\u03b1(2 + o(1))-risk-competitive to DYNASAGA(\u03c1), with\nprobability at least 1 \u2212 \u0001, at any time step i > max((16( \u03c3\nPoisson Arrivals. We next consider the case where the number of points arriving in each time step\nfollows a Poisson distribution with mean \u03bb, i.e., Pr [xi = k] = e\u2212\u03bb\u03bbk\nLemma 9. For Poisson arrival distribution with mean \u03bb, STRSAGA is 8\u03b1(2 + o(1))-risk-competitive\nto DYNASAGA(\u03c1) with probability at least 1 \u2212 \u0001, at any time step i > 16\nThe proof depends on a version of the Chernoff bounds tailored to the Poisson distribution\u2014further\ndetails are in the supplementary material.\n\nfor integer k \u2265 0.\n\n\u03bb )2 + 8\n\n\u0001 .\n\u03bb ln 1\n\nk!\n\n3 ) ln 1\n\n\u0001 , 2(( \u03c3\n\n\u03bb )2 + b\n\n\u03bb ) ln 1\n\n\u0001 ).\n\n6 Experimental results\n\nWe empirically con\ufb01rm the competitiveness of STRSAGA with the of\ufb02ine algorithm DYNASAGA(\u03c1)\nthrough a set of experiments on real world datasets streamed in under various arrival distributions.\nWe consider two optimization problems that arise in supervised learning, logistic regression (con-\nvex) and matrix factorization (nonconvex). For logistic regression, we use the A9A [DKT17] and\nRCV1.binary [LYRL04] datasets, and for matrix factorization, we use two datasets of user-item rat-\nings from Movielens [HK16]. More detail on the datasets are provided in the supplementary material.\nThese static training data are converted into streams, by ordering them by a random permutation, and\nde\ufb01ning an arrival rate \u03bb dependent on the dataset size. In our experiments, the training data arrives\nover the course of 100 time steps, with skewed arrivals parameterized by M = 8\u03bb. Experiments on\nPoisson arrivals are given in the supplementary material.\nAt each time step i, a streaming data algorithm has access to \u03c1 gradient computations to update the\nmodel; we show results for \u03c1/\u03bb = 1 and \u03c1/\u03bb = 5. We compare the sub-optimality of STRSAGA\nwith the of\ufb02ine algorithm DYNASAGA(\u03c1), which is run from scratch at each time i using \u03c1i steps\non Si. We also compare with two streaming data algorithms, SGD, and for the case \u03c1/\u03bb = 1, the\nsingle-pass algorithm SSVRG.4 In the streaming data setting, in which we are not limited in storage\nand the available processing time \u03c1 may permit revisiting points, our implementation of SGD needs\nclari\ufb01cation in its sampling procedure. We tried two sampling policies. In the \ufb01rst, at each time step\ni we sample points uniformly from Si, the set of all points received till time step i. In the second,\nat each time step i we \ufb01rst visit points in Si that have not been seen yet, and spend any remaining\nprocessing time to sample uniformly from all of Si. In every case, the second method was better\nor indistinguishable from the \ufb01rst, and so all of our results are based on the second method. For\nour implementation of SSVRG, we have relaxed the memory limitation of the original streaming\nalgorithm by introducing a buffer to store points that have arrived but not yet been processed. With\nthis additional storage, we allow SSVRG to make progress during time steps even when no new points\narrive, and hence make for a fairer comparison when data points do not arrive at a steady rate.\nThe main results are summarized in Figure 2, showing the sub-optimality of each algorithm and the\nsample-competitive ratio for STRSAGA. Additional plots of the test loss are given in the supplementary\nmaterial. The dips in the sample-competitive ratio represent the arrival of a large group of points,\nand correspondingly at those times, the sub-optimality spikes, because there are now many new\npoints added to Si that have yet to be processed. We observe that the sample-competitive ratio\nimproves over the lifetime of the stream and tends towards 1, outperforming our pessimistic theoretical\nanalysis. Furthermore, as the sample-competitive ratio increases, the risk-competitiveness of STRSAGA\nimproves so that the sub-optimality of STRSAGA is comparable to that of the of\ufb02ine DYNASAGA(\u03c1),\nwhich is the best we can do given limited computational power. In Figure 2, we also observe that\nSTRSAGA outperforms both our streaming data version of SGD, due to the faster convergence rate\nwhen using SAGA steps with reduced variance, and also SSVRG, showing the bene\ufb01t of revisiting\ndata points, even when the processing rate is constrained at \u03c1 = 1\u03bb.\n\n4We consider SSVRG a \u03c1/\u03bb = 1 algorithm, because for most data points it receives, it uses 1 gradient\n\ncomputation, and only for an o(1) fraction of the data points does it require 2 gradient computations.\n\n8\n\n\fFigure 2: Sub-optimality under skewed arrivals with M = 8\u03bb. Top row is processing rate \u03c1 = 1\u03bb,\nand bottom row is \u03c1 = 5\u03bb. The median is taken over 5 runs.\n\nTo better understand the impact of the skewed arrival distribution on the performance of STRSAGA,\nwe did three experiments in Figure 3, showing the following results. (1) As M/\u03bb increases, the\narrivals become more bursty and it takes longer for STRSAGA to be sample-competitive, and as a result,\nrisk-competitive to DYNASAGA(\u03c1). Note that the far left endpoint, for skewed arrival parameterized\nwith M = \u03bb, is the case of constant arrivals. (2) We observe that there is an intermediate point for\n\u03c1/\u03bb where it is more dif\ufb01cult to be sample-competitive, but at the extremes the ratio tends towards\n1. This is because for large \u03c1/\u03bb, whenever a big group of points arrives they can all be processed\nquickly. On the other hand, for small \u03c1/\u03bb, at any time i, both STRSAGA and the of\ufb02ine algorithm are\nstill processing points that arrived at some time signi\ufb01cantly before i, and so a large variance in the\namount of fresh arrivals at the tail of the stream can be tolerated. (3) The bound on sub-optimality\nwe showed earlier is dependent on the number of data points processed so far. As we see, as time\npasses and STRSAGA sees more data points, its sub-optimality on Si improves. Additionally as \u03c1/\u03bb\nincreases, STRSAGA has more steps available to incorporate newly arrived data points and becomes\nmore resilient to bursty arrivals.\n\nFigure 3: Sensitivity analysis. The \ufb01rst plot varies the skew M/\u03bb for a \ufb01xed processing rate \u03c1/\u03bb,\nand the second two plots vary the processing rate for a \ufb01xed skew. Results are plotted for time steps\ni = 25, 50, 75, 100 over a stream of the RCV dataset of 100 time steps. The median is taken over 9\nruns.\n\n7 Conclusion\n\nWe considered the ongoing maintenance of a model over data points that are arriving over time,\naccording to an unknown arrival distribution. We presented STRSAGA, and showed through both\nanalysis and experiments that, for various arrival distributions, (i) its empirical risk over the data\narriving so far is close to the empirical risk minimizer over the same data, (ii) it is competitive with\na state-of-the-art of\ufb02ine algorithm DYNASAGA, and (iii) it signi\ufb01cantly outperforms streaming data\nversions of both SGD and SSVRG. We conclude that STRSAGA should be the algorithm of choice for\nvariance-reduced SGD on streaming data in the setting where memory is not limited.\n\n9\n\n02040608010010\u2212210\u221210.00.20.40.60.81.0a9a, \u03c1=1\u03bb0204060801001002\u00d710\u221213\u00d710\u221214\u00d710\u221216\u00d710\u221210.00.20.40.60.81.0rcv, \u03c1=1\u03bb0204060801001009\u00d710\u221210.00.20.40.60.81.0moviele s1m, \u03c1=1\u03bb02040608010010\u2212210\u221210.00.20.40.60.81.0a9a, \u03c1=5\u03bb02040608010010\u221210.00.20.40.60.81.0rcv, \u03c1=5\u03bb02040608010010\u221210.00.20.40.60.81.0moviele s1m, \u03c1=5\u03bbTimeSuboptimality o SitSTR/tDSTRSAGASSVRGDYNASAGA(\u03c1)SGDtSTR/tD2021222324M/\u03bb0.60.70.80.91.0tSTR/tDSample-competitive ratio (\u03c1=1\u03bb)i=25i=50i=75i=1002\u221222\u2212120212223\u03c1/\u03bb0.70.80.91.0tSTR/tDSample-competitive ratio (M=23\u03bb)i=25i=50i=75i=1002\u221222\u2212120212223\u03c1/\u03bb0.20.40.6Suboptimality on SiSuboptimality (M=23\u03bb)i=25i=50i=75i=100\fReferences\n\n[BB07] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, pages 161\u2013168,\n\n2007.\n\n[BDR15] B. Bercu, B. Delyon, and E. Rio. Concentration inequalities for sums and martingales.\n\nSpringer, 2015.\n\n[Ber16] D. P. Bertsekas. Nonlinear Programming (3rd Ed.). Athena Scienti\ufb01c, 2016.\n\n[BL03] L. Bottou and Y. LeCun. Large scale online learning. In NIPS, pages 217\u2013224, 2003.\n\n[DBLJ14] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In NIPS, pages 1646\u20131654,\n2014.\n\n[DKT17] D. Dua and E. Karra Taniskidou. UCI machine learning repository, 2017.\n\n[DLH16] H. Daneshmand, A. Lucchi, and T. Hofmann. Starting small-learning with adaptive\n\nsample sizes. In ICML, pages 1463\u20131471, 2016.\n\n[FGKS15] R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Competing with the empirical risk\n\nminimizer in a single pass. In COLT, pages 728\u2013763, 2015.\n\n[HK16] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM TIIS,\n\n5(4):19, 2016.\n\n[JZ13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive\n\nvariance reduction. In NIPS, pages 315\u2013323, 2013.\n\n[KR13] J. Konecn`y and P. Richt\u00e1rik. Semi-stochastic gradient descent methods. arXiv preprint\n\narXiv:1312.1666, 2013.\n\n[LYRL04] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text\n\ncategorization research. JMLR, 5(Apr):361\u2013397, 2004.\n\n[Mas07] P. Massart. Concentration inequalities and model selection. Springer, 2007.\n\n[MU17] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomization and Proba-\nbilistic Techniques in Algorithms and Data Analysis. Cambridge university press, 2017.\n\n[RM51] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathemat-\n\nical Statistics, pages 400\u2013407, 1951.\n\n[RSB12] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential\n\nconvergence rate for \ufb01nite training sets. In NIPS, pages 2663\u20132671, 2012.\n\n[SAKS16] V. Shah, M. Asteris, A. Kyrillidis, and S. Sanghavi. Trading-off variance and complexity\n\nin stochastic gradient descent. arXiv preprint arXiv:1603.06861, 2016.\n\n10\n\n\f", "award": [], "sourceid": 6442, "authors": [{"given_name": "Ellango", "family_name": "Jothimurugesan", "institution": "CMU"}, {"given_name": "Ashraf", "family_name": "Tahmasbi", "institution": "Iowa State University"}, {"given_name": "Phillip", "family_name": "Gibbons", "institution": "CMU"}, {"given_name": "Srikanta", "family_name": "Tirthapura", "institution": "Iowa State University"}]}