{"title": "Leveraged volume sampling for linear regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2505, "page_last": 2514, "abstract": "Suppose an n x d design matrix in a linear regression problem is given, \nbut the response for each point is hidden unless explicitly requested. \nThe goal is to sample only a small number k << n of the responses, \nand then produce a weight vector whose sum of squares loss over *all* points is at most 1+epsilon times the minimum. \nWhen k is very small (e.g., k=d), jointly sampling diverse subsets of\npoints is crucial. One such method called \"volume sampling\" has\na unique and desirable property that the weight vector it produces is an unbiased\nestimate of the optimum. It is therefore natural to ask if this method\noffers the optimal unbiased estimate in terms of the number of\nresponses k needed to achieve a 1+epsilon loss approximation.\n\nSurprisingly we show that volume sampling can have poor behavior when\nwe require a very accurate approximation -- indeed worse than some\ni.i.d. sampling techniques whose estimates are biased, such as\nleverage score sampling. \nWe then develop a new rescaled variant of volume sampling that\nproduces an unbiased estimate which avoids\nthis bad behavior and has at least as good a tail bound as leverage\nscore sampling: sample size k=O(d log d + d/epsilon) suffices to\nguarantee total loss at most 1+epsilon times the minimum\nwith high probability. Thus, we improve on the best previously known\nsample size for an unbiased estimator, k=O(d^2/epsilon).\n\nOur rescaling procedure leads to a new efficient algorithm\nfor volume sampling which is based\non a \"determinantal rejection sampling\" technique with\npotentially broader applications to determinantal point processes.\nOther contributions include introducing the\ncombinatorics needed for rescaled volume sampling and developing tail\nbounds for sums of dependent random matrices which arise in the\nprocess.", "full_text": "Leveraged volume sampling for linear regression\n\nMicha\u0142 Derezi\u00b4nski and Manfred K. Warmuth\n\nDepartment of Computer Science\nUniversity of California, Santa Cruz\n\nmderezin@berkeley.edu, manfred@ucsc.edu\n\nDaniel Hsu\n\nComputer Science Department\nColumbia University, New York\n\ndjhsu@cs.columbia.edu\n\nAbstract\n\nSuppose an n \u21e5 d design matrix in a linear regression problem is given, but the\nresponse for each point is hidden unless explicitly requested. The goal is to sample\nonly a small number k \u2327 n of the responses, and then produce a weight vector\nwhose sum of squares loss over all points is at most 1 + \u270f times the minimum.\nWhen k is very small (e.g., k = d), jointly sampling diverse subsets of points\nis crucial. One such method called volume sampling has a unique and desirable\nproperty that the weight vector it produces is an unbiased estimate of the optimum.\nIt is therefore natural to ask if this method offers the optimal unbiased estimate in\nterms of the number of responses k needed to achieve a 1 + \u270f loss approximation.\nSurprisingly we show that volume sampling can have poor behavior when we\nrequire a very accurate approximation \u2013 indeed worse than some i.i.d. sampling\ntechniques whose estimates are biased, such as leverage score sampling. We then\ndevelop a new rescaled variant of volume sampling that produces an unbiased\nestimate which avoids this bad behavior and has at least as good a tail bound as\nleverage score sampling: sample size k = O(d log d + d/\u270f) suf\ufb01ces to guarantee\ntotal loss at most 1 + \u270f times the minimum with high probability. Thus we improve\non the best previously known sample size for an unbiased estimator, k = O(d2/\u270f).\nOur rescaling procedure leads to a new ef\ufb01cient algorithm for volume sampling\nwhich is based on a determinantal rejection sampling technique with potentially\nbroader applications to determinantal point processes. Other contributions include\nintroducing the combinatorics needed for rescaled volume sampling and developing\ntail bounds for sums of dependent random matrices which arise in the process.\n\n1\n\nIntroduction\n\nConsider a linear regression problem where the input points in Rd are provided, but the associated\nresponse for each point is withheld unless explicitly requested. The goal is to sample the responses\nfor just a small subset of inputs, and then produce a weight vector whose total square loss on all n\npoints is at most 1 + \u270f times that of the optimum.1 This scenario is relevant in many applications\nwhere data points are cheap to obtain but responses are expensive. Surprisingly, with the aid of\nhaving all input points available, such multiplicative loss bounds are achievable without any range\ndependence on the points or responses common in on-line learning [see, e.g., 8].\nA natural and intuitive approach to this problem is volume sampling, since it prefers \u201cdiverse\u201d sets of\npoints that will likely result in a weight vector with low total loss, regardless of what the corresponding\nresponses turn out to be [11]. Volume sampling is closely related to optimal design criteria [18, 26],\nwhich are appropriate under statistical models of the responses; here we study a worst-case setting\nwhere the algorithm must use randomization to guard itself against worst-case responses.\n\n1The total loss being 1 + \u270f times the optimum is the same as the regret being \u270f times the optimum.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fVolume sampling and related determinantal point processes are employed in many machine learning\nand statistical contexts, including linear regression [11, 13, 26], clustering and matrix approxima-\ntion [4, 14, 15], summarization and information retrieval [19, 23, 24], and fairness [6, 7]. The\navailability of fast algorithms for volume sampling [11, 26] has made it an important technique in the\nalgorithmic toolbox alongside i.i.d. leverage score sampling [17] and spectral sparsi\ufb01cation [5, 25].\nIt is therefore surprising that using volume sampling in the context of linear regression, as suggested\nin previous works [11, 26], may lead to suboptimal performance. We construct an example in which,\neven after sampling up to half of the responses, the loss of the weight vector from volume sampling\nis a \ufb01xed factor >1 larger than the minimum loss. Indeed, this poor behavior arises because for\nany sample size >d, the marginal probabilities from volume sampling are a mixture of uniform\nprobabilities and leverage score probabilities, and uniform sampling is well-known to be suboptimal\nwhen the leverage scores are highly non-uniform.\nA possible recourse is to abandon volume sampling in\nfavor of leverage score sampling [17, 33]. However,\nall i.i.d. sampling methods, including leverage score\nsampling, suffer from a coupon collector problem that\nprevents their effective use at small sample sizes [13].\nMoreover, the resulting weight vectors are biased\n(when regarded as estimators for the least squares\nsolution based on all responses). This is a nuisance\nwhen averaging multiple solutions (e.g., as produced\nin distributed settings). In contrast, volume sampling\noffers multiplicative loss bounds even with sample\nsizes as small as d and it is the only known non-trivial\nmethod that gives unbiased weight vectors [11].\nWe develop a new solution, called leveraged volume sampling, that retains the aforementioned bene\ufb01ts\nof volume sampling while avoiding its \ufb02aws. Speci\ufb01cally, we propose a variant of volume sampling\nbased on rescaling the input points to \u201ccorrect\u201d the resulting marginals. On the algorithmic side, this\nleads to a new determinantal rejection sampling procedure which offers signi\ufb01cant computational\nadvantages over existing volume sampling algorithms, while at the same time being strikingly simple\nto implement. We prove that this new sampling scheme retains the bene\ufb01ts of volume sampling (like\nunbiasedness) but avoids the bad behavior demonstrated in our lower bound example. Along the\nway, we prove a new generalization of the Cauchy-Binet formula, which is needed for the rejection\nsampling denominator. Finally, we develop a new method for proving matrix tail bounds for leveraged\nvolume sampling. Our analysis shows that the unbiased least-squares estimator constructed this way\nachieves a 1 + \u270f approximation factor from a sample of size O(d log d + d/\u270f), addressing an open\nquestion posed by [11].\n\nFigure 1: Plots of the total loss for the sam-\npling methods (averaged over 100 runs) ver-\nsus sample size (shading is standard error) for\nthe libsvm dataset cpusmall [9].\n\nExperiments. Figure 1 presents experimental evidence on a benchmark dataset (cpusmall from the\nlibsvm collection [9]) that the potential bad behavior of volume sampling proven in our lower bound\ndoes occur in practice. Appendix E shows more datasets and a detailed discussion of the experiments.\nIn summary, leveraged volume sampling avoids the bad behavior of standard volume sampling, and\nperforms considerably better than leverage score sampling, especially for small sample sizes k.\n\nRelated work. Despite the ubiquity of volume sampling in many contexts already mentioned above,\nit has only recently been analyzed for linear regression. Focusing on small sample sizes, [11] proved\nmultiplicative bounds for the expected loss of size k = d volume sampling. Because the estimators\nproduced by volume sampling are unbiased, averaging a number of such estimators produced an\nestimator based on a sample of size k = O(d2/\u270f) with expected loss at most 1 + \u270f times the optimum.\nIt was shown in [13] that if the responses are assumed to be linear functions of the input points\nplus white noise, then size k = O(d/\u270f) volume sampling suf\ufb01ces for obtaining the same expected\nbounds. These noise assumptions on the response vector are also central to the task of A-optimal\ndesign, where volume sampling is a key technique [2, 18, 28, 29]. All of these previous results were\nconcerned with bounds that hold in expectation; it is natural to ask if similar (or better) bounds can\nalso be shown to hold with high probability, without noise assumptions. Concentration bounds for\nvolume sampling and other strong Rayleigh measures were studied in [30], but these results are not\nsuf\ufb01cient to obtain the tail bounds for volume sampling.\n\n2\n\n\fOther techniques applicable to our linear regression problem include leverage score sampling [17]\nand spectral sparsi\ufb01cation [5, 25]. Leverage score sampling is an i.i.d. sampling procedure which\nachieves tail bounds matching the ones we obtain here for leveraged volume sampling, however it\nproduces biased weight vectors and experimental results (see [13] and Appendix E) show that it\nhas weaker performance for small sample sizes. A different and more elaborate sampling technique\nbased on spectral sparsi\ufb01cation [5, 25] was recently shown to be effective for linear regression [10],\nhowever this method also does not produce unbiased estimates, which is a primary concern of this\npaper and desirable in many settings. Unbiasedness seems to require delicate control of the sampling\nprobabilities, which we achieve using determinantal rejection sampling.\n\nOutline and contributions. We set up our task of subsampling for linear regression in the next\nsection and present our lower bound for standard volume sampling. A new variant of rescaled volume\nsampling is introduced in Section 3. We develop techniques for proving matrix expectation formulas\nfor this variant which show that for any rescaling the weight vector produced for the subproblem is\nunbiased.\nNext, we show that when rescaling with leverage scores, then a new algorithm based on rejection\nsampling is surprisingly ef\ufb01cient (Section 4): Other than the preprocessing step of computing leverage\nscores, the runtime does not depend on n (a major improvement over existing volume sampling\nalgorithms). Then, in Section 4.1 we prove multiplicative loss bounds for leveraged volume sampling\nby establishing two important properties which are hard to prove for joint sampling procedures. We\nconclude in Section 5 with an open problem and with a discussion of how rescaling with approximate\nleverage scores gives further time improvements for constructing an unbiased estimator.\n\n2 Volume sampling for linear regression\n\nIn this section, we describe our linear regression setting, and review the guarantees that standard\nvolume sampling offers in this context. Then, we present a surprising lower bound which shows that\nunder worst-case data, this method can exhibit undesirable behavior.\n\n2.1 Setting\nSuppose the learner is given n input vectors x1, . . . , xn 2 Rd, which are arranged as the rows of an\nn \u21e5 d input matrix X. Each input vector xi has an associated response variable yi 2 R from the\nresponse vector y 2 Rn. The goal of the learner is to \ufb01nd a weight vector w 2 Rd that minimizes\nthe square loss:\n\nw\u21e4 def\n\n= argmin\nw2Rd\n\nL(w), where L(w)\n\ndef\n=\n\n(x>i w  yi)2 = kXw  yk2.\n\nnXi=1\n\nGiven both matrix X and vector y, the least squares solution can be directly computed as w\u21e4 = X+y,\nwhere X+ is the pseudo-inverse. Throughout the paper we assume w.l.o.g. that X has (full) rank d.2\nIn our setting, the learner is initially given the entire input matrix X, while response vector y remains\nhidden. The learner is then allowed to select a subset S of row indices in [n] = {1, . . . , n} for\nusing matrix X and the partial vector of observed responses. Finally, the learner is evaluated by the\nloss over all rows of X (including the ones with unobserved responses), and the goal is to obtain a\nmultiplicative loss bound, i.e., that for some \u270f> 0,\n\nwhich the corresponding responses yi are revealed. The learner next constructs an estimate bw of w\u21e4\n\nL(bw) \uf8ff (1 + \u270f) L(w\u21e4).\n\n2.2 Standard volume sampling\nGiven X 2 Rn\u21e5d and a size k  d, standard volume sampling jointly chooses a set S of k indices in\n[n] with probability\n\n2Otherwise just reduce X to a subset of independent columns. Also assume X has no rows of all zeros (every\n\nweight vector has the same loss on such rows, so they can be removed).\n\nPr(S) =\n\ndet(X>SXS)\n\nnd\nkd det(X>X)\n\n,\n\n3\n\n\fwhere XS is the submatrix of the rows from X indexed by the set S. The learner then obtains\nthe responses yi, for i 2 S, and uses the optimum solution w\u21e4S = (XS)+yS for the subproblem\n(XS, yS) as its weight vector. The sampling procedure can be performed using reverse iterative\nsampling (shown on the right), which, if carefully implemented, takes O(nd2) time (see [11, 13]).\nThe key property (unique to volume sampling) is that the subsam-\npled estimator w\u21e4S is unbiased, i.e.\n\nE[w\u21e4S] = w\u21e4, where w\u21e4 = argmin\n\nw\n\nL(w).\n\nAs discussed in [11], this property has important practical implica-\ntions in distributed settings: Mixtures of unbiased estimators remain\nunbiased (and can conveniently be used to reduce variance). Also\nif the rows of X are in general position, then for volume sampling\n(1)\n\n(X>X)1.\n\nE\u21e5(X>SXS)1\u21e4 =\n\nn  d + 1\nk  d + 1\n\nReverse iterative sampling\nVolumeSample(X, k) :\n\nS [n]\nwhile |S| > k\n8i2S : qi \nSample i / qi out of S\nS S\\{i}\n\ndet(X>S\\iXS\\i)\ndet(X>S XS )\n\nend\n\nreturn S\n\nThis is important because in A-optimal design bounding tr((X>SXS)1) is the main concern. Given\nthese direct connections of volume sampling to linear regression, it is natural to ask whether this\ndistribution achieves a loss bound of (1 + \u270f) times the optimum for small sample sizes k.\n\n2.3 Lower bound for standard volume sampling\nWe show that standard volume sampling cannot guarantee 1 + \u270f multiplicative loss bounds on some\ninstances, unless over half of the rows are chosen to be in the subsample.\nTheorem 1 Let (X, y) be an n \u21e5 d least squares problem, such that\n\nX =0BB@\n\nId\u21e5d\n Id\u21e5d\n...\n\n Id\u21e5d\n\n1CCA , y =0BB@\n\n1d\n0d\n...\n0d\n\n1CCA ,\n\nwhere > 0.\n\nLet w\u21e4S = (XS)+yS be obtained from size k volume sampling for (X, y). Then,\n\nand there is a > 0 such that for any k \uf8ff n\n2 ,\n\nE[L(w\u21e4S)]\nL(w\u21e4)  1 +\n\nlim\n!0\n\nn  k\nn  d\n\n,\n\nPr\u2713L(w\u21e4S) \u21e31 +\n\n1\n\n2\u2318L(w\u21e4)\u25c6 >\n\n(2)\n\n(3)\n\n1\n4\n\n.\n\nProof In Appendix A we show (2), and that for the chosen (X, y) we have L(w\u21e4) =Pd\n\ni=1(1  li)\n(see (8)), where li = x>i (X>X)1xi is the i-th leverage score of X. Here, we show (3). The\nmarginal probability of the i-th row under volume sampling (as given by [12]) is\nn  k\nn  d\n\nPr(i 2 S) = \u2713l i + (1  \u2713) 1 = 1  \u2713 (1  li), where \u2713 =\n\nNext, we bound the probability that all of the \ufb01rst d input vectors were selected by volume sampling:\n\n(4)\n\n.\n\n\uf8ff\n\ni=1(1li)\n\ndYi=1\n\nn  k\nn  d\n\nPr(i 2 S) =\n\ndYi=1\u21e31 \n\nPd\n(1  li)\u2318 \uf8ff exp\u21e3 \nL(w\u21e4) \u2318,\nz }| {\nPr[d] \u2713 S (\u21e4)\nwhere (\u21e4) follows from negative associativity of volume sampling (see [26]). If for some i 2 [d] we\n3 and any k \uf8ff n\nhave i 62 S, then L(w\u21e4S)  1. So for  such that L(w\u21e4) = 2\n2 :\n3\u2318  1  exp\u21e3\n2\n\nL(w\u21e4)\u25c6  1  exp\u21e3\nz }| {\n\nPr\u2713L(w\u21e4S) \u21e31 +\n\nNote that this lower bound only makes use of the negative associativity of volume sampling and the\nform of the marginals. However the tail bounds we prove in Section 4.1 rely on more subtle properties\nof volume sampling. We begin by creating a variant of volume sampling with rescaled marginals.\n\nn  k\nn  d \u00b7\n\n2\n\n3\u2318 >\n\nn  k\nn  d\n\n1\n\n2\u2318\n\n1\n2 \u00b7\n\n1\n4\n\n.\n\n2/3\n\n4\n\n\f3 Rescaled volume sampling\nGiven any size k  d, our goal is to jointly sample k row indices \u21e11, . . . ,\u21e1 k with replacement\n(instead of a subset S of [n] of size k, we get a sequence \u21e1 2 [n]k). The second difference to standard\nvolume sampling is that we rescale the i-th row (and response) by 1pqi\n, where q = (q1, ..., qn) is any\ndiscrete distribution over the set of row indices [n], such thatPn\ni=1 qi = 1 and qi > 0 for all i 2 [n].\nWe now de\ufb01ne q-rescaled size k volume sampling as a joint sampling distribution over \u21e1 2 [n]k, s.t.\nPr(\u21e1) \u21e0 det\u21e3 kXi=1\n(5)\nq-rescaled size k volume sampling:\n=P|\u21e1|i=1\n\nUsing the following rescaling matrix Q\u21e1\ne\u21e1ie>\u21e1i 2 Rn\u21e5n, we rewrite the determinant\nas det(X>Q\u21e1X). As in standard volume sampling, the normalization factor in rescaled volume\nsampling can be given in a closed form through a novel extension of the Cauchy-Binet formula (proof\nin Appendix B.1).\n\nx\u21e1ix>\u21e1i\u2318 kYi=1\n\n1\nq\u21e1i\n\nq\u21e1i.\n\n1\nq\u21e1i\n\ndef\n\nProposition 2 For any X 2 Rn\u21e5d, k  d and q1, . . . , qn > 0, such thatPn\n\ni=1 qi = 1, we have\n\ndet(X>Q\u21e1X)\n\nq\u21e1i = k(k1)\u00b7\u00b7\u00b7 (kd+1) det(X>X).\n\nX\u21e12[n]k\n\nkYi=1\n\nGiven a matrix X 2 Rn\u21e5d, vector y 2 Rn and a sequence \u21e1 2 [n]k, we are interested in a least-\nsquares problem (Q1/2\n\u21e1 y), which selects instances indexed by \u21e1, and rescales each of them\nby the corresponding 1/pqi. This leads to a natural subsampled least squares estimator\n\n\u21e1 X, Q1/2\n\nw\u21e4\u21e1 = argmin\n\nw\n\nkXi=1\n\n1\n\nq\u21e1ix>\u21e1iw  y\u21e1i2 = (Q1/2\n\n\u21e1 X)+Q1/2\n\n\u21e1 y.\n\nThe key property of standard volume sampling is that the subsampled least-squares estimator is\nunbiased. Surprisingly this property is retained for any q-rescaled volume sampling (proof in Section\n3.1). As we shall see, this will give us great leeway for choosing q to optimize our algorithms.\nTheorem 3 Given a full rank X 2 Rn\u21e5d and a response vector y 2 Rn, for any q as above, if \u21e1 is\nsampled according to (5), then\n\nE[w\u21e4\u21e1] = w\u21e4, where w\u21e4 = argmin\n\nw\n\nkXw  yk2.\n\nThe matrix expectation equation (1) for standard volume sampling (discussed in Section 2) has a\nnatural extension to any rescaled volume sampling, but now the equality turns into an inequality\n(proof in Appendix B.2):\nTheorem 4 Given a full rank X 2 Rn\u21e5d and any q as above, if \u21e1 is sampled according to (5), then\n\nE\u21e5(X>Q\u21e1X)1\u21e4 \n\n1\n\nkd+1\n\n(X>X)1.\n\n3.1 Proof of Theorem 3\nWe show that the least-squares estimator w\u21e4\u21e1 = (Q1/2\n\u21e1 y produced from any q-rescaled vol-\nume sampling is unbiased, illustrating a proof technique which is also useful for showing Theorem 4,\nas well as Propositions 2 and 5. The key idea is to apply the pseudo-inverse expectation formula for\nstandard volume sampling (see e.g., [11]) \ufb01rst on the subsampled estimator w\u21e4\u21e1, and then again on\nthe full estimator w\u21e4. In the \ufb01rst step, this formula states:\n\n\u21e1 X)+Q1/2\n\n(Q1/2\n\n\u21e1 X)+Q1/2\n\nw\u21e4\u21e1\n\n}|\n\n{\n\u21e1 y = XS2([k]\n\nd)\n\nz\n\n(Q1/2\n\n\u21e1S X)+Q1/2\n\n\u21e1S y,\n\nw\u21e4\u21e1S\n\n}|\n\n{\n\nz\n\ndet(X>Q\u21e1S X)\ndet(X>Q\u21e1X)\n\n5\n\n\felements of set S. Note that since S is of size d, we can decompose the determinant:\n\n|S| = d} and \u21e1S denotes a subsequence of \u21e1 indexed by the\n\n= {S \u2713{ 1, . . . , k} :\n\nwhere[k]\nd def\n\nWhenever this determinant is non-zero, w\u21e4\u21e1S is the exact solution of a system of d linear equations:\n\ndet(X>Q\u21e1S X) = det(X\u21e1S )2 Yi2S\n\n1\nq\u21e1i\n\n.\n\n1\npq\u21e1i\n\nx>\u21e1iw =\n\ny\u21e1i,\n\nfor\n\ni 2 S.\n\n1\npq\u21e1i\n1pq\u21e1i\n\n(1)\n\nd det(X>X), and obtain:\nd)\u2713 Yi2[k]\\S\n\nThus, the rescaling of each equation by\ncancels out, and we can simply write w\u21e4\u21e1S =\n(X\u21e1S )+y\u21e1S. (Note that this is not the case for sets larger than d whenever the optimum solu-\ntion incurs positive loss.) We now proceed with summing over all \u21e1 2 [n]k. Following Proposition 2,\nwe de\ufb01ne the normalization constant as Z = d!k\nq\u21e1i\u25c6 det(X\u21e1S )2(X\u21e1S )+y\u21e1S\nZ E[w\u21e4\u21e1] =X\u21e12[n]k\u2713 kYi=1\n= \u2713k\nd\u25c6 X\u00af\u21e12[n]d\n= \u2713k\nd\u25c6d! XS2([n]\n\nq\u21e1i\u25c6 det(X>Q\u21e1X) w\u21e4\u21e1 =X\u21e12[n]k XS2([k]\nkdYi=1\ndet(X\u00af\u21e1)2(X\u00af\u21e1)+y\u00af\u21e1 X\u02dc\u21e12[n]kd\nqi\u25c6kd\ndet(XS)2(XS)+yS\u2713 nXi=1\nNote that in (1) we separate \u21e1 into two parts, \u00af\u21e1 and \u02dc\u21e1 (respectively, for subsets S and [k]\\S), and\nd counts the number of ways that S can be\nsum over them separately. The binomial coef\ufb01cientk\n\u201cplaced into\u201d the sequence \u21e1. In (2) we observe that whenever \u00af\u21e1 has repetitions, determinant det(X\u00af\u21e1)\nis zero, so we can switch to summing over sets. Finally, (3) again uses the standard size d volume\nsampling unbiasedness formula, now for the least-squares task (X, y), and the fact that qi\u2019s sum to 1.\n\nz\n\u2713k\nd\u25c6d! det(X>X) w\u21e4.\n\n}|\n\n{\n\n(3)\n=\n\nq\u02dc\u21e1i\n\nd)\n\n(2)\n\nZ\n\n4 Leveraged volume sampling: a natural rescaling\n\nthen matrix 1\nof the covariance matrix because E[q1\nX>X.\n\nDeterminantal rejection sampling\n1: Input: X2Rn\u21e5d, q = ( l1\nd , . . . , ln\n2: s max{k, 4d2}\n3: repeat\n4:\n5:\n6: until Accept = true\n\nRescaled volume sampling can be viewed as select-\ning a sequence \u21e1 of k rank-1 matrices from the co-\nvariance matrix X>X =Pn\ni=1 xix>i . If \u21e11, . . . ,\u21e1 k\nare sampled i.i.d. from q, i.e., Pr(\u21e1) = Qk\ni=1 q\u21e1i,\nk X>Q\u21e1X is an unbiased estimator\n\u21e1i x\u21e1ix>\u21e1i] =\nIn rescaled volume sampling (5), Pr(\u21e1) \u21e0\nQk\ni=1 q\u21e1i det(X>Q\u21e1X)\n, and the latter volume ratio\nintroduces a bias to that estimator. However, we show\nthat this bias vanishes when q is exactly proportional to the leverage scores (proof in Appendix B.3).\nProposition 5 For any q and X as before, if \u21e1 2 [n]k is sampled according to (5), then\ndef\n= x>i (X>X)1xi.\n\nSample \u21e11, . . . ,\u21e1 s i.i.d. \u21e0 (q1, . . . , qn)\nSample Accept \u21e0 Bernoulli\u21e3 det( 1\n[1..n]X)\u21e1, k\n\n7: S VolumeSample(Q1/2\n\ndet(X>X) \u2318\n\n8: return \u21e1S\n\nd ), k  d\n\ns X>Q\u21e1 X)\n\ndet(X>X)\n\n, . . . ,\n\nln\n\nli\n\nE[Q\u21e1] = (kd) I + diag\u21e3 l1\n\nq1\n\nqn\u2318, where\n\nk X>Q\u21e1X] = X>E[ 1\n\nk Q\u21e1]X = X>X if and only if qi = li\n\nIn particular, E[ 1\nThis special rescaling, which we call leveraged volume sampling, has other remarkable properties.\nMost importantly, it leads to a simple and ef\ufb01cient algorithm we call determinantal rejection sampling:\nRepeatedly sample O(d2) indices \u21e11, . . . ,\u21e1 s i.i.d. from q = ( l1\nd ), and accept the sample\nwith probability proportional to its volume ratio. Having obtained a sample, we can further reduce its\nsize via reverse iterative sampling. We show next that this procedure not only returns a q-rescaled\nvolume sample, but also exploiting the fact that q is proportional to the leverage scores, it requires\n(surprisingly) only a constant number of iterations of rejection sampling with high probability.\n\nd > 0 for all i 2 [n].\n\nd , . . . , ln\n\n6\n\n\fTheorem 6 Given the leverage score distribution q = ( l1\nd ) and the determinant det(X>X)\nfor matrix X 2 Rn\u21e5d, determinantal rejection sampling returns sequence \u21e1S distributed according\nto leveraged volume sampling, and w.p. at least 1 \ufb01nishes in time O((d2+ k)d2 ln( 1\nProof We use a composition property of rescaled volume sampling (proof in Appendix B.4):\nLemma 7 Consider the following sampling procedure, for s > k:\n\nd , . . . , ln\n\n )).\n\nThen \u21e1S is distributed according to q-rescaled size k volume sampling from X.\nFirst, we show that the rejection sampling probability in line 5 of the algorithm is bounded by 1:\n\n\u21e1 s\u21e0 X\n\nS k\u21e0 0B@\n\n1pq\u21e11\n. . .\n1pq\u21e1s\n\ndet( 1\n\ns X>Q\u21e1X)\n\ndet(X>X)\n\n(q-rescaled size s volume sampling),\n\nx>\u21e1s\n\nx>\u21e11\n\n(standard size k volume sampling).\n\n1CA =Q1/2\n[1..n]X\u21e1\n\uf8ff \u2713 1\nX>Q\u21e1X(X>X)1\u2318\u25c6d\nX>Q\u21e1X(X>X)1\u2318 (\u21e4)\n= det\u21e3 1\ntr\u21e3 1\nx>i (X>X)1xi\u2318d\n=\u21e3 1\ntrQ\u21e1X(X>X)1X>\u2318d\n=\u21e3 1\nsXi=1\n\nd\nli\n\nds\n\nds\n\ns\n\nd\n\ns\n\n= 1,\n\nwhere (\u21e4) follows from the geometric-arithmetic mean inequality for the eigenvalues of the underlying\nmatrix. This shows that sequence \u21e1 is drawn according to q-rescaled volume sampling of size s. Now,\nLemma 7 implies correctness of the algorithm. Next, we use Proposition 2 to compute the expected\nvalue of acceptance probability from line 5 under the i.i.d. sampling of line 4:\n\nX\u21e12[n]s\u2713 sYi=1\n\nq\u21e1i\u25c6 det( 1\n\ns X>Q\u21e1X)\n\ndet(X>X)\n\n=\n\ns(s1) . . . (sd+1)\n\nsd\n\n\u21e31 \n\nd\n\ns\u2318d\n\n 1 \n\nd2\ns \n\n3\n4\n\n,\n\n )/ ln( 4\n\nwhere we also used Bernoulli\u2019s inequality and the fact that s  4d2 (see line 2). Since the expected\nvalue of the acceptance probability is at least 3\n4, an easy application of Markov\u2019s inequality shows\nthat at each trial there is at least a 50% chance of it being above 1\n2. So, the probability of at least r\ntrials occurring is less than (1  1\n4 )r. Note that the computational cost of one trial is no more than the\ncost of SVD decomposition of matrix X>Q\u21e1X (for computing the determinant), which is O(sd2).\nThe cost of reverse iterative sampling (line 7) is also O(sd2) with high probability (as shown by\n[13]). Thus, the overall runtime is O((d2 + k)d2r), where r \uf8ff ln( 1\n4.1 Tail bounds for leveraged volume sampling\nAn analysis of leverage score sampling, essentially following [33, Section 2] which in turn draws\nfrom [31], highlights two basic suf\ufb01cient conditions on the (random) subsampling matrix Q\u21e1 that\nlead to multiplicative tail bounds for L(w\u21e4\u21e1).\nIt is convenient to shift to an orthogonalization of the linear regression task (X, y) by replacing\nmatrix X with a matrix U = X(X>X)1/2 2 Rn\u21e5d. It is easy to check that the columns of U have\nunit length and are orthogonal, i.e., U>U = I. Now, v\u21e4 = U>y is the least-squares solution for\nthe orthogonal problem (U, y) and prediction vector Uv\u21e4 = UU>y for (U, y) is the same as the\nprediction vector Xw\u21e4 = X(X>X)1X>y for the original problem (X, y). The same property\nholds for the subsampled estimators, i.e., Uv\u21e4\u21e1 = Xw\u21e4\u21e1, where v\u21e4\u21e1 = (Q1/2\n\u21e1 y. Volume\nsampling probabilities are also preserved under this transformation, so w.l.o.g. we can work with the\northogonal problem. Now L(v\u21e4\u21e1) can be rewritten as\n\n3 ) w.p. at least 1  .\n\n\u21e1 U)+Q1/2\n\nL(v\u21e4\u21e1) = kUv\u21e4\u21e1  yk2 (1)\n\n= kUv\u21e4  yk2 + kU(v\u21e4\u21e1  v\u21e4)k2 (2)\n\n(6)\nwhere (1) follows via Pythagorean theorem from the fact that U(v\u21e4\u21e1  v\u21e4) lies in the column span\nof U and the residual vector r = Uv\u21e4  y is orthogonal to all columns of U, and (2) follows from\nU>U = I. By the de\ufb01nition of v\u21e4\u21e1, we can write kv\u21e4\u21e1  v\u21e4k as follows:\n(7)\nd\u21e5d\n\nkv\u21e4\u21e1  v\u21e4k = k(U>Q\u21e1U)1 U>Q\u21e1(y  Uv\u21e4)k \uf8ff k(U>Q\u21e1U)1\n\n= L(v\u21e4) + kv\u21e4\u21e1  v\u21e4k2,\n\nkkU>Q\u21e1 r\n\nd\u21e51\n\nk,\n\n7\n\n\fwhere kAk denotes the matrix 2-norm (i.e., the largest singular value) of A; when A is a vector, then\nkAk is its Euclidean norm. This breaks our task down to showing two key properties:\n\n1. Matrix multiplication: Upper bounding the Euclidean norm kU>Q\u21e1 rk,\n2. Subspace embedding: Upper bounding the matrix 2-norm k(U>Q\u21e1U)1k.\n\nWe start with a theorem that implies strong guarantees for approximate matrix multiplication with\nleveraged volume sampling. Unlike with i.i.d. sampling, this result requires controlling the pairwise\ndependence between indices selected under rescaled volume sampling. Its proof is an interesting\napplication of a classical Hadamard matrix product inequality from [3] (Proof in Appendix C).\nTheorem 8 Let U 2 Rn\u21e5d be a matrix s.t. U>U = I. If sequence \u21e1 2 [n]k is selected using\nleveraged volume sampling of size k  2d\n\n\u270f , then for any r 2 Rn,\n2 \uf8ff \u270fkrk2.\n\nU>Q\u21e1r  U>r\n\n1\nk\n\nE\uf8ff\n\nNext, we turn to the subspace embedding property. The following result is remarkable because\nstandard matrix tail bounds used to prove this property for leverage score sampling are not applicable\nto volume sampling. In fact, obtaining matrix Chernoff bounds for negatively associated joint\ndistributions like volume sampling is an active area of research, as discussed in [21]. We address\nthis challenge by de\ufb01ning a coupling procedure for volume sampling and uniform sampling without\nreplacement, which leads to a curious reduction argument described in Appendix D.\nTheorem 9 Let U 2 Rn\u21e5d be a matrix s.t. U>U = I. There is an absolute constant C, s.t. if\nsequence \u21e1 2 [n]k is selected using leveraged volume sampling of size k  C d ln( d\n\n ), then\n\nPr\u2713min\u21e3 1\n\nk\n\nU>Q\u21e1U\u2318 \uf8ff\n\n1\n\n8\u25c6 \uf8ff .\n\nw\u21e4\u21e1 = argmin\n\nTheorems 8 and 9 imply that the unbiased estimator w\u21e4\u21e1 produced from leveraged volume sampling\nachieves multiplicative tail bounds with sample size k = O(d log d + d/\u270f).\nCorollary 10 Let X 2 Rn\u21e5d be a full rank matrix. There is an absolute constant C, s.t. if sequence\n\u270f, then for estimator\n\u21e1 2 [n]k is selected using leveraged volume sampling of size k  Cd ln( d\n\u21e1 (Xw  y)k2,\nwe have L(w\u21e4\u21e1) \uf8ff (1 + \u270f) L(w\u21e4) with probability at least 1  .\nProof Let U = X(X>X)1/2. Combining Theorem 8 with Markov\u2019s inequality, we have that for\nlarge enough C, kU>Q\u21e1 rk2 \uf8ff \u270f k2\n82 krk2 w.h.p., where r = y  Uv\u21e4. Finally following (6) and (7)\nabove, we have that w.h.p.\nL(w\u21e4\u21e1) \uf8ff L(w\u21e4) + k(U>Q\u21e1U)1k2 kU>Q\u21e1 rk2 \uf8ff L(w\u21e4) +\n5 Conclusion\n\nk2\n82 krk2 = (1 + \u270f) L(w\u21e4).\n\nkQ1/2\n\n82\nk2 \u270f\n\n )+ d\n\nw\n\nWe developed a new variant of volume sampling which produces the \ufb01rst known unbiased subsampled\nleast-squares estimator with strong multiplicative loss bounds. In the process, we proved a novel\nextension of the Cauchy-Binet formula, as well as other fundamental combinatorial equalities.\nMoreover, we proposed an ef\ufb01cient algorithm called determinantal rejection sampling, which is to our\nknowledge the \ufb01rst joint determinantal sampling procedure that (after an initial O(nd2) preprocessing\n\nstep for computing leverage scores) produces its k samples in time eO(d2+k)d2), independent of the\ndata size n. When n is very large, the preprocessing time can be reduced to eO(nd + d5) by rescaling\n\nwith suf\ufb01ciently accurate approximations of the leverage scores. Surprisingly the estimator stays\nunbiased and the loss bound still holds with only slightly revised constants. For the sake of clarity we\npresented the algorithm based on rescaling with exact leverage scores in the main body of the paper.\nHowever we outline the changes needed when using approximate leverage scores in Appendix F.\nIn this paper we focused on tail bounds. However we conjecture that there are also volume sampling\n\u270f ).\nbased unbiased estimators achieving expected loss bounds E[L(w\u21e4\u21e1)] \uf8ff (1+\u270f)L(w\u21e4) with size O( d\n\n8\n\n\fAcknowledgements\nMicha\u0142 Derezi\u00b4nski and Manfred K. Warmuth were supported by NSF grant IIS-1619271. Daniel Hsu\nwas supported by NSF grant CCF-1740833.\n\nReferences\n[1] Nir Ailon and Bernard Chazelle. The fast Johnson\u2013Lindenstrauss transform and approximate\n\nnearest neighbors. SIAM Journal on computing, 39(1):302\u2013322, 2009.\n\n[2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experi-\nments via regret minimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the\n34th International Conference on Machine Learning, volume 70 of Proceedings of Machine\nLearning Research, pages 126\u2013135, International Convention Centre, Sydney, Australia, 2017.\n\n[3] T Ando, Roger A. Horn, and Charles R. Johnson. The singular values of a Hadamard product:\n\nA basic inequality. Journal of Linear and Multilinear Algebra, 21(4):345\u2013365, 1987.\n\n[4] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM\n\nJournal on Matrix Analysis and Applications, 34(4):1464\u20131499, 2013.\n\n[5] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-Ramanujan sparsi\ufb01ers. SIAM\n\nJournal on Computing, 41(6):1704\u20131721, 2012.\n\n[6] L Elisa Celis, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. How to be fair and\n\ndiverse? arXiv:1610.07183, October 2016.\n\n[7] L Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and\nNisheeth K Vishnoi. Fair and diverse dpp-based data summarization. arXiv:1802.04023,\nFebruary 2018.\n\n[8] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth. Worst-case quadratic loss bounds for on-line\nprediction of linear functions by gradient descent. IEEE Transactions on Neural Networks,\n7(3):604\u2013619, 1996. Earlier version in 6th COLT, 1993.\n\n[9] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[10] Xue Chen and Eric Price. Condition number-free query and active learning of linear families.\n\nCoRR, abs/1711.10051, 2017.\n\n[11] Micha\u0142 Derezi\u00b4nski and Manfred K Warmuth. Unbiased estimates for linear regression via\nvolume sampling. In Advances in Neural Information Processing Systems 30, pages 3087\u20133096,\nLong Beach, CA, USA, December 2017.\n\n[12] Micha\u0142 Derezi\u00b4nski and Manfred K. Warmuth. Reverse iterative volume sampling for linear\n\nregression. Journal of Machine Learning Research, 19(23):1\u201339, 2018.\n\n[13] Micha\u0142 Derezi\u00b4nski and Manfred K. Warmuth. Subsampling for ridge regression via regularized\nvolume sampling. In Proceedings of the 21st International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2018.\n\n[14] Amit Deshpande and Luis Rademacher. Ef\ufb01cient volume sampling for row/column subset\nselection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer\nScience, FOCS \u201910, pages 329\u2013338, Washington, DC, USA, 2010.\n\n[15] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation\nand projective clustering via volume sampling. In Proceedings of the Seventeenth Annual\nACM-SIAM Symposium on Discrete Algorithm, SODA \u201906, pages 1117\u20131126, Philadelphia, PA,\nUSA, 2006.\n\n9\n\n\f[16] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast\napproximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13(1):3475\u2013\n3506, December 2012.\n\n[17] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for `2\nregression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium\non Discrete algorithm, pages 1127\u20131136, 2006.\n\n[18] Valerii V. Fedorov, William J. Studden, and E. M. Klimko, editors. Theory of optimal experi-\n\nments. Probability and mathematical statistics. Academic Press, New York, 1972.\n\n[19] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point\nprocesses. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys \u201916,\npages 349\u2013356, New York, NY, USA, 2016.\n\n[20] David Gross and Vincent Nesme. Note on sampling without replacing from a \ufb01nite collection\n\nof matrices. arXiv:1001.2738, January 2010.\n\n[21] Nicholas JA Harvey and Neil Olver. Pipage rounding, pessimistic estimators and matrix\nconcentration. In Proceedings of the twenty-\ufb01fth annual ACM-SIAM symposium on Discrete\nalgorithms, pages 926\u2013945. SIAM, 2014.\n\n[22] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American statistical association, 58(301):13\u201330, 1963.\n\n[23] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceed-\nings of the 28th International Conference on Machine Learning, pages 1193\u20131200. Omnipress,\n2011.\n\n[24] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now\n\nPublishers Inc., Hanover, MA, USA, 2012.\n\n[25] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsi\ufb01cation in almost-linear time.\nIn Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages\n250\u2013269. IEEE, 2015.\n\n[26] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume\nsampling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5045\u20135054.\n2017.\n\n[27] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach.\n\nLearn., 3(2):123\u2013224, February 2011.\n\n[28] Zelda E. Mariet and Suvrit Sra. Elementary symmetric polynomials for optimal experimental\ndesign. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2136\u20132145.\n2017.\n\n[29] Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao Tantipongpipat. Proportional volume\nsampling and approximation algorithms for A-optimal design. arXiv:1802.08318, July 2018.\n[30] Robin Pemantle and Yuval Peres. Concentration of Lipschitz functionals of determinantal and\nother strong rayleigh measures. Combinatorics, Probability and Computing, 23(1):140\u2013160,\n2014.\n\n[31] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In\nProceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS\n\u201906, pages 143\u2013152, Washington, DC, USA, 2006.\n\n[32] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Compu-\n\ntational Mathematics, 12(4):389\u2013434, August 2012.\n\n[33] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends\u00ae\n\nin Theoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1249, "authors": [{"given_name": "Michal", "family_name": "Derezinski", "institution": "UC Berkeley"}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": "Univ. of Calif. at Santa Cruz"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}]}