{"title": "Efficient Symmetric Norm Regression via Linear Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 830, "page_last": 840, "abstract": "We provide efficient algorithms for overconstrained linear regression problems with size $n \\times d$ when the loss function is a symmetric norm (a norm invariant under sign-flips and coordinate-permutations). An important class of symmetric norms are Orlicz norms, where for a function $G$ and a vector $y \\in \\mathbb{R}^n$, the corresponding Orlicz norm $\\|y\\|_G$ is defined as the unique value $\\alpha$ such that $\\sum_{i=1}^n G(|y_i|/\\alpha) = 1$. When the loss function is an Orlicz norm, our algorithm produces a $(1 + \\varepsilon)$-approximate solution for an arbitrarily small constant $\\varepsilon > 0$ in input-sparsity time, improving over the previously best-known algorithm which produces a $d \\cdot \\polylog n$-approximate solution. When the loss function is a general symmetric norm, our algorithm produces a $\\sqrt{d} \\cdot \\polylog n \\cdot \\mathrm{mmc}(\\ell)$-approximate solution in input-sparsity time, where $\\mathrm{mmc}(\\ell)$ is a quantity related to the symmetric norm under consideration. To the best of our knowledge, this is the first input-sparsity time algorithm with provable guarantees for the general class of symmetric norm regression problem. Our results shed light on resolving the universal sketching problem for linear regression, and the techniques might be of independent interest to numerical linear algebra problems more broadly.", "full_text": "Ef\ufb01cient Symmetric Norm Regression via Linear\n\nSketching\u21e4\n\nZhao Song\n\nUniversity of Washington\n\nmagic.linuxkde@gmail.com\n\nRuosong Wang\n\nCarnegie Mellon University\nruosongw@andrew.cmu.edu\n\nUniversity of California, Los Angeles\n\nToyota Technological Institute at Chicago\n\nLin F. Yang\n\nlinyang@ee.ucla.edu\n\nHongyang Zhang\n\nhongyanz@ttic.edu\n\nPeilin Zhong\n\nColumbia University\n\npz2225@columbia.edu\n\nAbstract\n\nWe provide ef\ufb01cient algorithms for overconstrained linear regression problems\nwith size n\u21e5 d when the loss function is a symmetric norm (a norm invariant under\nsign-\ufb02ips and coordinate-permutations). An important class of symmetric norms\nare Orlicz norms, where for a function G and a vector y 2 Rn, the corresponding\nOrlicz norm kykG is de\ufb01ned as the unique value \u21b5 such thatPn\ni=1 G(|yi|/\u21b5) = 1.\nWhen the loss function is an Orlicz norm, our algorithm produces a (1 + \")-\napproximate solution for an arbitrarily small constant \"> 0 in input-sparsity\ntime, improving over the previously best-known algorithm which produces a\nd \u00b7 polylog n-approximate solution. When the loss function is a general symmetric\nnorm, our algorithm produces a pd \u00b7 polylog n \u00b7 mmc(`)-approximate solution\nin input-sparsity time, where mmc(`) is a quantity related to the symmetric norm\nunder consideration. To the best of our knowledge, this is the \ufb01rst input-sparsity\ntime algorithm with provable guarantees for the general class of symmetric norm\nregression problem. Our results shed light on resolving the universal sketching\nproblem for linear regression, and the techniques might be of independent interest\nto numerical linear algebra problems more broadly.\n\nIntroduction\n\n1\nLinear regression is a fundamental problem in machine learning. For a data matrix A 2 Rn\u21e5d and a\nresponse vector b 2 Rn with n d, the overconstrained linear regression problem can be formulated\nas solving the following optimization problem:\n(1)\n\nmin\n\nx2Rd L(Ax b),\n\nwhere L : Rn ! R is a loss function. Via the technique of linear sketching, we have witnessed many\nremarkable speedups for linear regression for a wide range of loss functions. Such technique involves\ndesigning a sketching matrix S 2 Rr\u21e5n, and showing that by solving a linear regression instance on\nthe data matrix SA and the response vector Sb, which is usually much smaller in size, one can obtain\n\n\u21e4All authors contribute equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: M-estimators\n\nHUBER \u21e2 x2/2\n\n`1 `2\n\u201cFAIR\"\n\nc(|x| c/2)\n\n|x|\uf8ff c\n|x| > c\n2(p1 + x2/2 1)\nc2 (|x|/c log(1 + |x|/c))\n\nan approximate solution to the original linear regression instance in (1). Sarl\u00f3s showed in [29] that by\ntaking S as a Fast Johnson-Lindenstrauss Transform matrix [1], one can obtain (1 + \")-approximate\nsolutions to the least square regression problem (L(y) = kyk2\n2) in O(nd log n + poly(d/\")) time.\nThe running time was later improved to O(nnz(A) + poly(d/\")) [12, 26, 28, 23, 15]. Here nnz(A)\nis the number of non-zero entries in the data matrix A, which could be much smaller than nd for\nsparse data matrices. This technique was later generalized to other loss functions. By now, we\np) [18, 26, 35, 16, 32], the\n\nhave eO(nnz(A) + poly(d/\")) time algorithms for `p norms (L(y) = kykp\n\nquantile loss function [36], M-estimators [14, 13] and the Tukey loss function [11].\nDespite we have successfully applied the technique of linear sketching to many different loss functions,\nideally, it would be more desirable to design algorithms that work for a wide range of loss functions,\ninstead of designing a new sketching algorithm for every speci\ufb01c loss function. Naturally, this leads\nto the following problem, which is the linear regression version of the universal sketching problem2\nstudied in streaming algorithms [10, 9]. We note that similar problems are also asked and studied for\nvarious algorithmic tasks, including principal component analysis [31], sampling [21], approximate\nnearest neighbor search [4, 3], discrepancy [17, 8], sparse recovery [27] and mean estimation with\nstatistical queries [19, 22].\nProblem 1. Is it possible to design sketching algorithms for linear regression, that work for a wide\nrange of loss functions?\n\nPrior to our work, [14, 13] studied this problem in terms of M-estimators, where the loss function\nemploys the form L(y) =Pn\ni=1 G(yi) for some function G. See Table 1 for a list of M-estimators.\nHowever, much less is known for the case where the loss function L(\u00b7) is a norm, except for `p norms.\nRecently, Andoni et al. [2] tackle Problem 1 for Orlicz norms, which can be seen as a scale-invariant\nversion of M-estimators. For a function G and a vector y 2 Rn with y 6= 0, the corresponding Orlicz\nnorm kykG is de\ufb01ned as the unique value \u21b5 such that\n\nnXi=1\n\nG(|yi|/\u21b5) = 1.\n\n(2)\n\nWhen y = 0, we de\ufb01ne kykG to be 0. Note that Orlicz norms include `p norms as special cases, by\ntaking G(z) = |z|p for some p 1. Under certain assumptions on the function G, [2] obtains the\n\ufb01rst input-sparsity time algorithm for solving Orlicz norm regression. More precisely, in eO(nnz(A) +\npoly(d log n)) time, their algorithm obtains a solutionbx 2 Rd such that kAbx bkG \uf8ff d\u00b7 polylog n\u00b7\nminx2Rd kAx bkG.\nThere are two natural problems left open by the work of [2]. First, the algorithm in [2] has approxi-\nmation ratio as large as d \u00b7 polylog n. Although this result is interesting from a theoretical point of\nview, such a large approximation ratio is prohibitive for machine learning applications in practice.\nIs it possible to obtain an algorithm that runs in eO(nnz(A) + poly(d/\")) time, with approximation\nratio 1 + \", for arbitrarily small \", similar to the case of `p norms? Moreover, although Orlicz norm\nincludes a wide range of norms, many other important norms, e.g., top-k norms (the sum of absolute\nvalues of the leading k coordinates of a vector), max-mix of `p norms (e.g. max{kxk2, ckxk1} for\nsome c > 0), and sum-mix of `p norms (e.g. kxk2 + ckxk1 for some c > 0), are not Orlicz norms.\nMore complicated examples include the k-support norm [5] and the box-norm [25], which have found\napplications in sparse recovery. In light of Problem 1, it is natural to ask whether it is possible to apply\nthe technique of linear sketching to a broader class of norms. In this paper, we obtain af\ufb01rmative\nanswers to both problems, and make progress towards \ufb01nally resolving Problem 1.\n\ndenote its i-th row, viewed as a column vector. For n real numbers x1, x2, . . . , xn, we de\ufb01ne\n\nNotations. We use eO(f ) to denote f polylog f. For a matrix A 2 Rn\u21e5d, we use Ai 2 Rd to\n\n2https://sublinear.info/index.php?title=Open_Problems:30.\n\n2\n\n\fdiag(x1, x2, . . . , xn) 2 Rn\u21e5n to be the diagonal matrix where the i-th diagonal entry is xi. For a\nvector x 2 Rn and p 1, we use kxkp to denote its `p norm, and kxk0 to denote its `0 norm, i.e.,\nthe number of non-zero entries in x. For two vectors x, y 2 Rn, we use hx, yi to denote their inner\nproduct. For any n > 0, we use [n] to denote the set {1, 2, . . . , n}. For 0 \uf8ff p \uf8ff 1, we de\ufb01ne Ber(p)\nto be the Bernoulli distribution with parameter p. We use Sn1 to denote the unit `2 sphere in Rn,\ni.e., Sn1 = {x 2 Rn |k xk2 = 1}. We use R0 to denote the set of all non-negative real numbers,\ni.e., R0 = {x 2 R | x 0}.\n1.1 Our Contributions\nAlgorithm for Orlicz Norms. Our \ufb01rst contribution is a uni\ufb01ed algorithm which produces (1 + \")-\napproximate solutions to the linear regression problem in (1), when the loss function L(\u00b7) is an Orlicz\nnorm. Before introducing our results, we \ufb01rst give our assumptions on the function G which appeared\nin (2).\nAssumption 1. We assume the function G : R ! R0 satis\ufb01es the following properties:\n\n1. G is a strictly increasing convex function on [0,1);\n2. G(0) = 0, and for all x 2 R, G(x) = G(x);\n3. There exists some CG > 0, such that for all 0 < x < y, G(y)/G(x) \uf8ff CG(y/x)2.\n\nThe \ufb01rst two conditions in Assumption 1 are necessary to make sure the corresponding Orlicz norm\nk\u00b7k G is indeed a norm, and the third condition requires the function G to have at most quadratic\ngrowth, which can be satis\ufb01ed by all M-estimators in Table 1 and is also required by prior work [2].\nNotice that our assumptions are weaker than those in [2]. In [2], it is further required that G(x) is a\nlinear function when x > 1, and G is twice differentiable on an interval (0, G) for some G > 0.\nGiven our assumptions on G, our main theorem is summarized as follows.\nTheorem 1. For a function G that satis\ufb01es Assumption 1, there exists an algorithm that, on any input\n\nA 2 Rn\u21e5d and b 2 Rn, \ufb01nds a vector x\u21e4 in time eO(nnz(A) + poly(d/\")), such that with probability\nat least 0.9, kAx\u21e4 bkG \uf8ff (1 + \") minx2Rd kAx bkG.\nTo the best of our knowledge, this is the \ufb01rst input-sparsity time algorithm with (1 + \")-approximation\nguarantee, that goes beyond `p norms, the quantile loss function, and M-estimators. See Table 2 for\na more comprehensive comparison with previous results.\nAlgorithm for Symmetric Norms. We further study the case when the loss function L(\u00b7) is a\nsymmetric norm. Symmetric norm is a more general class of norms, which includes all norms that\nare invariant under sign-\ufb02ips and coordinate-permutations. Formally, we de\ufb01ne symmetric norms as\nfollow.\nDe\ufb01nition 1. A norm k\u00b7k\nk(s1y1, s2y2, . . . , snyn)k` for any permutation and any assignment of si 2 {1, 1}.\nSymmetric norm includes `p norms and Orlicz norms as special cases. It also includes all examples\nprovided in the introduction, i.e., top-k norms, max-mix of `p norms, sum-mix of `p norms, the\nk-support norm [5] and the box-norm [25], as special cases. Understanding this general set of loss\nfunctions can be seen as a preliminary step to resolve Problem 1. Our main result for symmetric\nnorm regression is summarized in the following theorem.\nTheorem 2. Given a symmetric norm k\u00b7k `, there exists an algorithm that, on any input A 2 Rn\u21e5d\nand b 2 Rn, \ufb01nds a vector x\u21e4 in time eO(nnz(A) + poly(d)), such that with probability at least 0.9,\nkAx\u21e4 bk` \uf8ff pd \u00b7 polylog n \u00b7 mmc(`) \u00b7 minx2Rd kAx bk`.\nIn the above theorem, mmc(`) is a characteristic of the symmetric norm k\u00b7k `, which has been proven\nto be essential in streaming algorithms for symmetric norms [7]. See De\ufb01nition 7 for the formal\nde\ufb01nition of mmc(`), and Section 3 for more details about mmc(`). In particular, for `p norms with\np \uf8ff 2, top-k norms with k n/ polylog n, max-mix of `2 norm and `1 norm (max{kxk2, ckxk1} for\nsome c > 0), sum-mix of `2 norm and `1 norm (kxk2 + ckxk1 for some c > 0), the k-support norm,\nand the box-norm, mmc(`) can all be upper bounded by polylog n, which implies our algorithm has\napproximation ratio pd \u00b7 polylog n for all these norms. This clearly demonstrates the generality of\n\nif k(y1, y2, . . . , yn)k` =\n\nis called a symmetric norm,\n\n`\n\nour algorithm.\n\n3\n\n\fTable 2: Comparison among input-sparsity time linear regression algorithms\n\nReference\n\nLoss Function\n\nApproximation Ratio\n\n[18, 26, 35, 16, 32]\n\n[36]\n\n[14, 13]\n\n[2]\n\nTheorem 1\nTheorem 2\n\n`p norms\n\nQuantile loss function\n\nM-estimators\nOrlicz norms\nOrlicz norms\n\nSymmetric norms\n\n1 + \"\n1 + \"\n1 + \"\n\nd \u00b7 polylog n\n\n1 + \"\n\npd \u00b7 polylog n \u00b7 mmc(`)\n\nEmpirical Evaluation.\ndatasets. Our empirical results quite clearly demonstrate the practicality of our methods.\n\nIn Section E of the supplementary material, we test our algorithms on real\n\n1.2 Technical Overview\n\nSimilar to previous works on using linear sketching to speed up solving linear regression, our core\ntechnique is to provide ef\ufb01cient dimensionality reduction methods for Orlicz norms and general\nsymmetric norms. In this section, we discuss the techniques behind our results.\n\nRow Sampling Algorithm for Orlicz Norms. Compared to prior work on Orlicz norm regres-\nsion [2] which is based on random projection3, our new algorithm is based on row sampling. For\na given matrix A 2 Rn\u21e5d, our goal is to output a sparse weight vector w 2 Rn with at most\npoly(d log n/\") non-zero entries, such that with high probability, for all x 2 Rd,\n(3)\n(1 \")kAx bkG \uf8ff kAx bkG,w \uf8ff (1 + \")kAx bkG.\nHere, for a weight vector w 2 Rn and a vector y 2 Rn, the weighted Orlicz norm kykG,w is de\ufb01ned\nas the unique value \u21b5 such thatPn\ni=1 wiG(|yi|/\u21b5) = 1. See De\ufb01nition 4 for the formal de\ufb01nition of\nweighted Orlicz norm. To obtain a (1 + \")-approximate solution to Orlicz norm regression, by (3), it\nsuf\ufb01ces to solve\n\nmin\n\nx2Rd kAx bkG,w.\n\n(4)\n\n(5)\n\nSince the vector w 2 Rn has at most poly(d log n/\") non-zero entries, and we can ignore all rows\nof A with zero weights, there are at most poly(d log n/\") remaining rows in A in the optimization\nproblem in (4). Furthermore, as we show in Lemma 3, k\u00b7k G,w is a seminorm, which implies\nwe can solve the optimization problem in (4) in poly(d log n/\") time, by simply solving a convex\nprogram with size poly(d log n/\"). Thus, we focus on how to obtain the weight vector w 2 Rn in\nthe remaining part. Furthermore, by taking A to be a matrix whose \ufb01rst d columns are A and last\ncolumn is b, to satisfy (3), it suf\ufb01ces to \ufb01nd a weight vector w such that for all x 2 Rd+1,\n\n(1 \")kAxkG \uf8ff kAxkG,w \uf8ff (1 + \")kAxkG.\n\nHence, we ignore the response vector b in the remaining part of the discussion.\nWe obtain the weight vector w via importance sampling. We compute a set of sampling probabilities\ni=1 for each row of the data matrix A, and sample the rows of A according to these probabilities.\n{pi}n\nThe i-th entry of the weight vector w is then set to be wi = 1/pi with probability pi and wi = 0 with\nprobability 1 pi. However, unlike `p norms, Orlicz norms are not \u201centry-wise\u201d norms, and it is not\neven clear that such a sampling process gives an unbiased estimation. Our key insight here is that for\na vector Ax with unit Orlicz norm, if for all x 2 Rd,\n\n(1 \")\n\nG((Ax)i) \uf8ff\n\nwiG((Ax)i) \uf8ff (1 + \")\n\nG((Ax)i),\n\n(6)\n\nnXi=1\n\nnXi=1\n\nnXi=1\n\nthen (5) holds, which follows from the convexity of the function G. See Lemma 7 and its proof for\ni=1, such that the\nmore details. Therefore, it remains to develop a way to de\ufb01ne and calculate {pi}n\ntotal number of sampled rows is small.\n\n3Even for `p norms with p < 2, embeddings based on random projections will necessarily induce a distortion\n\nfactor polynomial in d, as shown in [32].\n\n4\n\n\fi=1 ui = d.\n\n2kxk2\n\n2 = kUik2\n\n2kU xk2\n\ni=1 ui.\n\n2. It is also clear thatPn\n\nOur method for de\ufb01ning and computing sampling probabilities pi is inspired by row sampling\nalgorithms for `p norms [18]. Here, the key is to obtain an upper bound on the contribution of\neach entry to the summationPn\ni=1 G((Ax)i). Indeed, suppose for some vector u 2 Rn such that\nG(Ax)i \uf8ff ui for all x 2 Rd with kAxkG = 1, we can then sample each row of A with sampling\nprobability proportional to ui. Now, by standard concentration inequalities and a net argument, (6)\nholds with high probability. It remains to upper bound the total number of sampled rows, which is\nproportional toPn\nWe use the case of `2 norm, i.e., G(x) = x2, as an example to illustrate our main ideas for choosing\nthe vector u 2 Rn. Suppose U 2 Rn\u21e5d is an orthonormal basis matrix of the column space of A,\nthen the leverage score4 is de\ufb01ned to be the squared `2 norm of each row of U. Indeed, leverage\nscore gives an upper bound on the contribution of each row to kU xk2\n2, since by Cauchy-Schwarz\ninequality, for each row Ui of U, we have hUi, xi2 \uf8ff kUik2\n2, and thus we can\nset ui = kUik2\nFor general Orlicz norms, leverage scores are no longer upper bounds on G((U x)i). Inspired by the\nrole of orthonormal bases in the case of `2 norm, we \ufb01rst de\ufb01ne well-conditioned basis for general\nOrlicz norms as follow.\nDe\ufb01nition 2. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nWe say U 2 Rn\u21e5d is a well-conditioned basis with condition number \uf8ffG = \uf8ffG(U ) if for all x 2 Rd,\nkxk2 \uf8ff kU xkG \uf8ff \uf8ffGkxk2.\nGiven this de\ufb01nition, when kU xkG = 1, by Cauchy-Schwarz inequality and monotonicity of G, we\ncan show that G((U x)i) \uf8ff G(kUik2kxk2) \uf8ff G(kUik2kU xkG) \uf8ff G(kUik2). This also leads to our\nde\ufb01nition of Orlicz norm leverage scores.\nDe\ufb01nition 3. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nFor a given matrix A 2 Rn\u21e5d and a well-conditioned basis U of the column space of A, the Orlicz\nnorm leverage score of the i-th row of A is de\ufb01ned to be G(kUik2).\nIt remains to give an upper bound on the summation of Orlicz norm leverage scores of all rows. Unlike\nthe `2 norm, it is not immediately clear how to use the de\ufb01nition of well-conditioned basis to obtain\nsuch an upper bound for general Orlicz norms. To achieve this goal, we use a novel probabilistic\nargument. Suppose one takes x to be a vector with i.i.d. Gaussian random variables. Then each entry\nof U x has the same distribution as kUik2 \u00b7 gi, where {gi}n\ni=1 is a set of standard Gaussian random\nvariables. Thus, with constant probability,Pn\ni=1 G((U x)i) is an upper bound on the summation\nof Orlicz norm leverage scores. Furthermore, by the growth condition of the function G, we have\nG. Now by De\ufb01nition 2, kU xkG \uf8ff \uf8ffGkxk2, and kxk2 \uf8ff O(pd) with\nPn\nconstant probability by tail inequalities of Gaussian random variables. This implies an upper bound\non the summation of Orlicz norm leverage scores. See Lemma 4 and its proof for more details.\nOur approach for constructing well-conditioned bases is inspired by [30]. In Lemma 5, we show that\ngiven a subspace embedding \u21e7 which embeds the column space of A with Orlicz norm k\u00b7k G into\nthe `2 space with distortion \uf8ff, then one can construct a well-conditioned basis with condition number\n\uf8ffG \uf8ff \uf8ff. The running time is dominated by calculating \u21e7A and doing a QR-decomposition on \u21e7A. To\nthis end, we can use the oblivious subspace embedding for Orlicz norms in Corollary 125 to construct\nwell-conditioned bases. The embedding in Corollary 12 has O(d) rows and \uf8ff = poly(d log n), and\ncalculating \u21e7A can be done in eO(nnz(A) + poly(d)) time. Using such an embedding to construct\nkwk0 \uf8ff poly(d log n/\") in time eO(nnz(A) + poly(d)).\nWe would like to remark that our sampling algorithm still works if the third condition in Assumption 1\ndoes not hold. In general, suppose the function G : R ! R satis\ufb01es that for all 0 < x < y,\nG(y)/G(x) \uf8ff CG(y/x)p, for the Orlicz norm induced by G, given a well-conditioned basis with\ncondition number \uf8ffG, our sampling algorithm returns a matrix with roughly O((pd\uf8ffG)p \u00b7 d/\"2)\nrows such that Theorem 1 holds. One may use the L\u00f6wner\u2013John ellipsoid as the well-conditioned\n\nthe well-conditioned basis, our row sampling algorithm produces a vector w that satis\ufb01es (6) with\n\ni=1 G((U x)i) \uf8ff CGkU xk2\n\n4See, e.g., [24], for a survey on leverage scores.\n5Alternatively, we can use the oblivious subspace embedding in [2] for this step. However, as we have\ndiscussed, the oblivious subspace embedding in [2] requires stronger assumptions on the function G : R ! R0\nthan those in Assumption 1, which restricts the class of Orlicz norms to which our algorithm can be applied.\n\n5\n\n\fbasis (as in [18]) which has condition number \uf8ffG = pd for any norm. However, calculating the\nL\u00f6wner\u2013John ellipsoid requires at least O(nd5) time. Moreover, our method described above fails\nwhen p > 2 since it requires an oblivious subspace embedding with poly(d) distortion, and it is\nknown that such embedding does not exist when p > 2 [10]. Since we focus on input-sparsity time\nalgorithms in this paper, we only consider the case that p \uf8ff 2.\nFinally, we would like to compare our sampling algorithm with that in [13]. First, the algorithm\nin [13] works for M-estimators, while we focus on Orlicz norms. Second, our de\ufb01nitions for Orlicz\nnorm leverage score and well-conditioned basis, as given in De\ufb01nition 2 and 3, are different from all\nprevious works and are closely related to the Orlicz norm under consideration. The algorithm in [13],\non the other hand, simply uses `p leverage scores. Under our de\ufb01nition, we can prove that the sum of\nleverage scores is bounded by O(CGd\uf8ff2\nG) (Lemma 4), whose proof requires a novel probabilistic\nargument. In contrast, the upper bound on sum of leverage scores in [13] is O(pnd) (Lemma 38 in\n[11]). Thus, the algorithm in [13] runs in an iterative manner since in each round the algorithm can\nmerely reduce the dimension from n to O(pnd), while our algorithm is one-shot.\n\nOblivious Subspace Embeddings for Symmetric Norms. To obtain a faster algorithm for linear\nregression when the loss function is a general symmetric norm, we show that there exists a distribution\nover embedding matrices, such that if S is a random matrix drawn from that distribution, then for\nany n \u21e5 d matrix A, with constant probability, for all x 2 Rd, kAxk` \uf8ff kSAxk2 \uf8ff poly(d log n) \u00b7\nmmc(`) \u00b7k Axk`. Moreover, the embedding matrix S is sparse, and calculating SA requires only\neO(nnz(A) + poly(d)) time. Another favorable property of S is that it is an oblivious subspace\nembeeding, meaning the distribution of S does not depend on A. To achieve this goal, it is suf\ufb01cient\nto construct a random diagonal matrix D such that for any \ufb01xed vector x 2 Rn,\n\nand\n\nPr[kDxk2 \u2326(1/ poly(d log n)) \u00b7k xk`] 1 exp(\u2326(d log n)),\n\nPr[kDxk2 \uf8ff poly(d log n) \u00b7 mmc(`) \u00b7k xk`] 1 O(1/d).\n\n(7)\n\n(8)\n\nOur construction is inspired by the sub-sampling technique in [20], which was used for sketching\nsymmetric norms in data streams [7]. Throughout the discussion, we use \u21e0(q) 2 Rn to denote a vector\nwith q non-zero entries and each entry is 1/pq. Let us start with a special case where the vector\nx 2 Rn has s non-zero entries and each non-zero entry is 1. It is easy to see kxk` = psk\u21e0(s)k`.\nNow consider a random diagonal matrix D which corresponds to a sampling process, i.e., each\ndiagonal entry is set to be 1 with probability p and 0 with probability 1 p. Our goal is to\nshow thatp1/pk\u21e0(1/p)k` \u00b7k Dxk2 is a good estimator of kxk`. If p =\u21e5( d log n/s), then with\nprobability at least 1 exp (\u2326(d log n)), Dx will contain at least one non-zero entry from x, in\nwhich case (7) is satis\ufb01ed. However, we do not know s in advance. Thus, we use t = O(log n)\ndifferent matrices D1, D2, . . . , Dt, where Di has sampling probability 1/2i. Clearly at least one\nsuch Dj can establish (7). For the upper bound part, if p is much smaller than 1/s, then Dx will\nnever contain a non-zero entry from x. Otherwise, in expectation Dx will contain ps non-zero\nentries, in which case our estimation will be roughly psk\u21e0(1/p)k`, which can be upper bounded by\nO(log n \u00b7 mmc(`) \u00b7 psk\u21e0(s)k`). At this point, (8) follows from Markov\u2019s inequality. See Section C.5\nfor the formal argument, and Section 3 for a detailed discussion on mmc(`).\nTo generalize the above argument to general vectors, for a vector x 2 Rn, we conceptually partition\nits entries into \u21e5(log n) groups, where the i-th group contains entries with magnitude in [2i, 2i+1).\nBy averaging, at least one group of entries contributes at least \u2326(1/ log n) fraction to the value of\nkxk`. To establish (7), we apply the lower bound part of the argument in the previous paragraph to\nthis \u201ccontributing\u201d group. To establish (8), we apply the upper bound part of the argument to all\ngroups, which will only induce an additional O(log n) factor in the approximation ratio, by triangle\ninequality.\nSince our oblivious subspace embedding embeds a given symmetric norm into the `2 space, in order\nto obtain an approximate solution to symmetric norm regression, we only need to solve a least squares\nregression instance with much smaller size. This is another advantage of our subspace embedding,\nsince the least square regression problem is a well-studied problem in optimization and numerical\nlinear algebra, for which many ef\ufb01cient algorithms are known, both in theory and in practice.\n\n6\n\n\f2 Linear Regression for Orlicz Norms\n\nG),\ni=1 G(kUik2) \uf8ff O(CGd\uf8ff2\n\n\uf8ff \u21e7A be a QR-decomposition of 1\n\nIn this section, we introduce our results for Orlicz norm regression. We \ufb01rst give the de\ufb01nition of\nweighted Orlicz norm.\nDe\ufb01nition 4. For a function G that satis\ufb01es Assumption 1 and a weight vector w 2 Rn such that\nwi 0 for all i 2 [n], for a vector x 2 Rn, ifPn\ni=1 wi \u00b7 |xi| = 0, then the weighted Orlicz norm\nkxkG,w is de\ufb01ned to be 0. Otherwise, the weighted Orlicz norm kxkG,w is de\ufb01ned as the unique\nvalue \u21b5> 0 such thatPn\ni=1 wiG(|xi|/\u21b5) = 1.\nWhen wi = 1 for all i 2 [n], we have kxkG,w = kxkG where kxkG is the (unweighted) Orlicz norm.\nIt is well known that k\u00b7k G is a norm. We show in the following lemma that k\u00b7k G,w is a seminorm.\nLemma 3. For a function G that satis\ufb01es Assumption 1 and a weight vector w 2 Rn such that wi 0\nfor all i 2 [n], for all x, y 2 Rn, we have (i) kxkG,w 0, (ii) kx + ykG,w \uf8ff kxkG,w + kykG,w, and\n(iii) kaxkG,w = |a| \u00b7k xkG,w for all a 2 R.\nLeverage Scores and Well-Conditioned Bases for Orlicz Norms. The following lemma estab-\nlishes an upper bound on the summation of Orlicz norm leverage scores de\ufb01ned in De\ufb01nition 3.\nLemma 4. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1. Let\nU 2 Rn\u21e5d be a well-conditioned basis with condition number \uf8ffG as in De\ufb01nition 2. Then we have\nPn\nNow we show that given a subspace embedding which embeds the column space of A with Orlicz\nnorm k\u00b7k G into the `2 space with distortion \uf8ff, then one can construct a well-conditioned basis with\ncondition number \uf8ffG \uf8ff \uf8ff.\nLemma 5. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nFor a given matrix A 2 Rn\u21e5d and an embedding matrix \u21e7 2 Rs\u21e5n, suppose for all x 2 Rd,\nkAxkG \uf8ff k\u21e7Axk2 \uf8ff \uf8ffkAxkG. Let Q \u00b7 R = 1\n\uf8ff \u21e7A. Then AR1 is\na well-conditioned basis (see De\ufb01nition 2) with \uf8ffG(AR1) \uf8ff \uf8ff.\nThe following lemma shows how to estimate Orlicz norm leverage scores given a change of basis\nmatrix R 2 Rd\u21e5d, in eO(nnz(A) + poly(d)) time.\nLemma 6. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1. For\na given matrix A 2 Rn\u21e5d and R 2 Rd\u21e5d, there exists an algorithm that outputs {ui}n\ni=1 such that\nwith probability at least 0.99, ui =\u21e5( G(k(AR1)ik2)) for all 1 \uf8ff i \uf8ff n. The algorithm runs in\neO(nnz(A) + poly(d)) time.\n\nThe Row Sampling Algorithm. Based on the notion of Orlicz norm leverage scores and well-\nconditioned bases, we design a row sampling algorithm for Orlicz norms.\nLemma 7. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nLet U 2 Rn\u21e5d be a well-conditioned basis with condition number \uf8ffG = \uf8ffG(U ) as in De\ufb01nition\n2. For suf\ufb01ciently small \" and , and suf\ufb01ciently large constant C, let {pi}n\ni=1 be a set of sampling\nprobabilities satisfying pi min1, C (log(1/) + d log(1/\")) \"2G (kUik2) . Let w be a vector\nwhose i-th entry is set to be wi = 1/pi with probability pi and wi = 0 with probability 1 pi, then\nwith probability at least 1 , for all x 2 Rd, we have (1 \")kU xkG \uf8ff kU xkG,w \uf8ff (1 + \")kU xkG.\nSolving Linear Regression for Orlicz Norms. Now we combine all ingredients to give an algo-\nrithm for Orlicz norm regression. We use A 2 Rn\u21e5(d+1) to denote a matrix whose \ufb01rst d columns\nare A and the last column is b. The algorithm is described in Figure 1, and we prove its running time\nand correctness in Theorem 8. We assume we are given an embedding matrix \u21e7, such that for all\nx 2 Rd+1, kAxkG \uf8ff k\u21e7Axk2 \uf8ff \uf8ffkAxkG. The construction of \u21e7 and the value \uf8ff will be given in\nCorollary 12. In Section D.1 of the supplementary material, we use Theorem 8 and Corollary 12 to\nformally prove Theorem 1.\n\n7\n\n\f1. For the given embedding matrix \u21e7, calculate \u21e7A and invoke QR-decomposition on\n\ni=1 be a set of sampling probabilities with\n\n\u21e7A/\uf8ff to obtain Q \u00b7 R =\u21e7 A/\uf8ff.\n2. Invoke Lemma 6 to obtain {ui}n\n3. For a suf\ufb01ciently large constant C, let {pi}n\npi min1, C \u00b7 d \u00b7 \"2 log(1/\") \u00b7 Gk(AR1)ik2 , and w be a vector whose i-th\nentry wi = 1/pi with probability pi and wi = 0 with probability 1 pi.\n4. Calculate x\u21e4 = argminx2Rd kAx bkG,w. Return x\u21e4.\n\ni=1 such that ui =\u21e5( G(k(AR1)ik2)).\n\nFigure 1: Algorithm for Orlicz norm regression\n\nTheorem 8. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nGiven an embedding matrix \u21e7, such that for all x 2 Rd, kAxkG \uf8ff k\u21e7Axk2 \uf8ff \uf8ffkAxkG, with\nprobability at least 0.9, the algorithm in Figure 1 outputs x\u21e4 2 Rd in time poly(d\uf8ff/\") + TQR(\u21e7A),\nsuch that kAx\u21e4 bkG \uf8ff (1 + \") minx2Rd kAx bkG. Here, TQR(\u21e7A) is the running time for\ncalculating \u21e7A and invoking QR-decomposition on \u21e7A.\n\n3 Linear Regression for Symmetric Norms\n\nIn this section, we introduce SymSketch, a subspace embedding for symmetric norms.\n\nDe\ufb01nition of SymSketch. We \ufb01rst formally de\ufb01ne SymSketch. Due to space limitation, we\ngive the de\ufb01nition of Gaussian embeddings, CountSketch embeddings and their compositions in\nSection C.1.1 of the supplementary material.\n\nDe\ufb01nition 5 (Symmetric Norm Sketch (SymSketch)). Let t = \u21e5(log n). Let eD 2 Rn(t+1)\u21e5n be a\nmatrix de\ufb01ned as eD =\u21e5(w0D0)> (w1D1)> . . .\n(wtDt)>\u21e4>, where for each i 2{ 0, 1, . . . , t},\nDi = diag(zi,1, zi,2, . . . , zi,n) 2 Rn\u21e5n and zi,j \u21e0 Ber(1/2i) for each j 2 [n]. Moreover, wi =\nk(1, 1, . . . , 1, 0, . . . , 0)k` (there are 2i 1s). Let \u21e7 2 RO(d)\u21e5n(t+1) be a composition of Gaussian\nembedding and CountSketch embedding (De\ufb01nition 12) with \" = 0.1, and S =\u21e7 eD. We say\nS 2 RO(d)\u21e5n is a SymSketch.\nModulus of Concentration. Now we give the de\ufb01nition of mmc(`) for a symmetric norm.\nDe\ufb01nition 6 ([7]). Let X denote the uniform distribution over Sn1. The median of a symmetric norm\nk\u00b7k ` is the unique value M` such that Prx\u21e0X [kxk` M`] 1/2 and Prx\u21e0X [kxk` \uf8ff M`] 1/2.\nDe\ufb01nition 7 ([7]). For a given symmetric norm k\u00b7k `, we de\ufb01ne the modulus of concentration to be\nmc(`) = maxx2Sn1 kxk`/M`, and de\ufb01ne the maximum modulus of concentration to be mmc(`) =\nmaxk2[n] mc(`(k)), where k\u00b7k `(k) is a norm on Rk which is de\ufb01ned to be k(x1, x2, . . . , xk)k`(k) =\nk(x1, x2, . . . , xk, 0, . . . , 0)k`.\nIt has been shown in [7] that mmc(`) =\u21e5( n1/21/p) for `p norms when p > 2, mmc(`) = \u21e5(1)\n\nfor `p norms when p \uf8ff 2, mmc(`) = e\u21e5(pn/k) for top-k norms, and mmc(`) = O(log n) for the\n\nk-support norm [5] and the box-norm [25]. We show that mmc(`) is upper bounded by O(1) for\nmax-mix of `2 norm and `1 norm and sum-mix of `2 norm and `1 norm.\nLemma 9. For a real number c > 0, let kxk`a = kxk2 + ckxk1 and kxk`b = max{kxk2, ckxk1}.\nWe have mmc(`a) = O(1) and mmc(`b) = O(1).\nMoreover, we show that for an Orlicz norm k\u00b7k G induced by a function G which satis\ufb01es Assump-\ntion 1, mmc(`) is upper bounded by O(pCG log n).\nLemma 10. For an Orlicz norm k\u00b7k G on Rn induced by a function G which satis\ufb01es Assumption 1,\nmmc(`) is upper bounded by O(pCG log n).\n\n8\n\n\fSubspace Embedding. The following theorem shows that SymSketch is a subspace embedding.\nTheorem 11. Let S 2 RO(d)\u21e5n be a SymSketch as de\ufb01ned in De\ufb01nition 5. For a given matrix\nA 2 Rn\u21e5d, with probability at least 0.9, for all x 2 Rd,\n\n\u2326\u21e31/(pd \u00b7 log3 n)\u2318 \u00b7k Axk` \uf8ff kSAxk2 \uf8ff O\u21e3mmc(`) \u00b7 d2 \u00b7 log5/2 n\u2318 \u00b7k Axk`.\n\nFurthermore, the running time of computing SA is eO(nnz(A) + poly(d)).\nCombine Theorem 11 with Lemma 10, we have the following corollary.\nCorollary 12. Let k\u00b7k G be an Orlicz norm induced by a function G which satis\ufb01es Assumption 1.\nLet S 2 RO(d)\u21e5n be a SymSketch as de\ufb01ned in De\ufb01nition 5. For a given matrix A 2 Rn\u21e5d, with\nprobability at least 0.9, for all x 2 Rd,\n\n\u2326\u21e31/(pd \u00b7 log3 n)\u2318 \u00b7k Axk` \uf8ff kSAxk2 \uf8ff O\u21e3pCG \u00b7 d2 \u00b7 log7/2 n\u2318 \u00b7k Axk`.\n\nFurthermore, the running time of computing SA is eO(nnz(A) + poly(d)).\n\n4 Conclusion\n\nIn this paper, we give ef\ufb01cient algorithms for solving the overconstrained linear regression problem,\nwhen the loss function is a symmetric norm. For the special case when the loss function is an Orlicz\n\nnorm, our algorithm produces a (1 + \")-approximate solution in eO(nnz(A) + poly(d/\")) time. When\nthe loss function is a general symmetric norm, our algorithm produces a pd \u00b7 polylog n \u00b7 mmc(`)-\napproximate solution in eO(nnz(A) + poly(d)) time.\n\nIn light of Problem 1, there are a few interesting problems that remain open. Is that possible to design\nan algorithm that produces (1 + \")-approximate solutions to the linear regression problem, when\nthe loss function is a general symmetric norm? Furthermore, is that possible to use the technique of\nlinear sketching to speed up the overconstrained linear regression problem, when the loss function is\na general norm? Answering these problems could lead to a better understanding of Problem 1.\n\nAcknowledgements\n\nP. Zhong is supported in part by NSF grants (CCF-1703925, CCF-1421161, CCF-1714818, CCF-\n1617955 and CCF-1740833), Simons Foundation (#491119), Google Research Award and a Google\nPh.D. fellowship. R. Wang is supported in part by NSF grant IIS-1763562, Of\ufb01ce of Naval Research\n(ONR) grants (N00014-18-1-2562, N00014-18-1-2861), and Nvidia NVAIL award. Part of this work\nwas done while Z. Song, L. F. Yang, H. Zhang and P. Zhong were interns at IBM Research - Almaden\nand while Z. Song, R. Wang and H. Zhang were visiting the Simons Institute for the Theory of\nComputing. Z. Song and P. Zhong would like to thank Alexandr Andoni, Kenneth L. Clarkson, Yin\nTat Lee, Eric Price, Clifford Stein and David P. Woodruff for insight discussions.\n\n9\n\n\fReferences\n[1] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss\n\ntransform. In STOC, pages 557\u2013563, 2006.\n\n[2] A. Andoni, C. Lin, Y. Sheng, P. Zhong, and R. Zhong. Subspace embedding and linear regression\n\nwith Orlicz norm. In ICML, pages 224\u2013233, 2018.\n\n[3] A. Andoni, A. Naor, A. Nikolov, I. Razenshteyn, and E. Waingarten. H\u00f6lder homeomorphisms\n\nand approximate nearest neighbors. In FOCS, pages 159\u2013169, 2018.\n\n[4] A. Andoni, H. L. Nguyen, A. Nikolov, I. Razenshteyn, and E. Waingarten. Approximate near\n\nneighbors for general symmetric norms. In STOC, pages 902\u2013913, 2017.\n\n[5] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In NIPS,\n\npages 1457\u20131465, 2012.\n\n[6] H. Auerbach. On the area of convex curves with conjugate diameters. PhD thesis, University of\n\nLw\u00f3w, 1930.\n\n[7] J. B\u0142asiok, V. Braverman, S. R. Chestnut, R. Krauthgamer, and L. F. Yang. Streaming symmetric\n\nnorms via measure concentration. In STOC, pages 716\u2013729, 2017.\n\n[8] P. Br\u00e4nd\u00e9n. Hyperbolic polynomials and the kadison-singer problem.\n\narXiv:1809.03255, 2018.\n\narXiv preprint\n\n[9] V. Braverman, S. R. Chestnut, D. P. Woodruff, and L. F. Yang. Streaming space complexity of\n\nnearly all functions of one variable on frequency vectors. In PODS, pages 261\u2013276, 2016.\n\n[10] V. Braverman and R. Ostrovsky. Zero-one frequency laws. In STOC, pages 281\u2013290, 2010.\n\n[11] K. L. Clarkson, R. Wang, and D. P. Woodruff. Dimensionality reduction for tukey regression.\n\nIn ICML, pages 1262\u20131271, 2019.\n\n[12] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity\n\ntime. In STOC, pages 81\u201390, 2013.\n\n[13] K. L. Clarkson and D. P. Woodruff. Input sparsity and hardness for robust subspace approxima-\n\ntion. In FOCS, pages 310\u2013329, 2015.\n\n[14] K. L. Clarkson and D. P. Woodruff. Sketching for M-estimators: A uni\ufb01ed approach to robust\n\nregression. In SODA, pages 921\u2013939, 2015.\n\n[15] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In SODA,\n\npages 278\u2013287, 2016.\n\n[16] M. B. Cohen and R. Peng. `p row sampling by lewis weights. In STOC, pages 183\u2013192, 2015.\n[17] D. Dadush, A. Nikolov, K. Talwar, and N. Tomczak-Jaegermann. Balancing vectors in any\n\nnorm. In FOCS, 2018.\n\n[18] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and\n\ncoresets for `p regression. SIAM Journal on Computing, 38(5):2060\u20132078, 2009.\n\n[19] V. Feldman, C. Guzm\u00e1n, and S. S. Vempala. Statistical query algorithms for mean vector\n\nestimation and stochastic convex optimization. In SODA, pages 1265\u20131277, 2017.\n\n[20] P. Indyk and D. P. Woodruff. Optimal approximations of the frequency moments of data streams.\n\nIn STOC, pages 202\u2013208, 2005.\n\n[21] Y. T. Lee, Z. Song, and S. S. Vempala. Algorithmic theory of odes and sampling from well-\n\nconditioned logconcave densities. arXiv preprint arXiv:1812.06243, 2018.\n\n[22] J. Li, A. Nikolov, I. Razenshteyn, and E. Waingarten. On mean estimation for general norms\n\nwith statistical queries. In COLT, 2019.\n\n10\n\n\f[23] M. Li, G. L. Miller, and R. Peng. Iterative row sampling. In FOCS, pages 127\u2013136, 2013.\n[24] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends R in\n\nMachine Learning, 3(2):123\u2013224, 2011.\n\n[25] A. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. In NIPS,\n\npages 3644\u20133652, 2014.\n\n[26] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and\n\napplications to robust linear regression. In STOC, pages 91\u2013100, 2013.\n\n[27] V. Nakos, Z. Song, and Z. Wang. Robust sparse recovery via m-estimators. Manuscript, 2019.\n\n[28] J. Nelson and H. L. Nguyeben. Sparsity lower bounds for dimensionality reducing maps. In STOC,\n\n[29] T. Sarl\u00f3s. Improved approximation algorithms for large matrices via random projections. In\n\npages 101\u2013110, 2013.\n\nFOCS, pages 143\u2013152, 2006.\n\n[30] C. Sohler and D. P. Woodruff. Subspace embeddings for the l1-norm with applications. In\n\nSTOC, pages 755\u2013764, 2011.\n\n[31] Z. Song, D. P. Woodruff, and P. Zhong. Towards a zero-one law for column subset selection. In\n\nNeurIPS, 2019.\n\n[32] R. Wang and D. P. Woodruff. Tight bounds for `p oblivious subspace embeddings. In SODA,\n\npages 1825\u20131843, 2019.\n\n[33] P. Wojtaszczyk. Banach spaces for analysts, volume 25. Cambridge University Press, 1996.\n[34] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in\n\nTheoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\n[35] D. P. Woodruff and Q. Zhang. Subspace embeddings and `p-regression using exponential\n\nrandom variables. In COLT, pages 546\u2013567, 2013.\n\n[36] J. Yang, X. Meng, and M. Mahoney. Quantile regression for large-scale applications. In ICML,\n\npages 881\u2013887, 2013.\n\n11\n\n\f", "award": [], "sourceid": 419, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "University of Washington"}, {"given_name": "Ruosong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Lin", "family_name": "Yang", "institution": "UCLA"}, {"given_name": "Hongyang", "family_name": "Zhang", "institution": "TTIC"}, {"given_name": "Peilin", "family_name": "Zhong", "institution": "Columbia University"}]}