{"title": "The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 219, "page_last": 228, "abstract": "We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications.", "full_text": "The Unreasonable Effectiveness of Structured\n\nRandom Orthogonal Embeddings\n\nKrzysztof Choromanski \u2217\nGoogle Brain Robotics\nkchoro@google.com\n\nMark Rowland \u2217\n\nUniversity of Cambridge\nmr504@cam.ac.uk\n\nAdrian Weller\n\nUniversity of Cambridge and Alan Turing Institute\n\naw665@cam.ac.uk\n\nAbstract\n\nWe examine a class of embeddings based on structured random matrices with\northogonal rows which can be applied in many machine learning applications\nincluding dimensionality reduction and kernel approximation. For both the Johnson-\nLindenstrauss transform and the angular kernel, we show that we can select matrices\nyielding guaranteed improved performance in accuracy and/or speed compared to\nearlier methods. We introduce matrices with complex entries which give signi\ufb01cant\nfurther accuracy improvement. We provide geometric and Markov chain-based\nperspectives to help understand the bene\ufb01ts, and empirical results which suggest\nthat the approach is helpful in a wider range of applications.\n\n1\n\nIntroduction\n\nEmbedding methods play a central role in many machine learning applications by projecting feature\nvectors into a new space (often nonlinearly), allowing the original task to be solved more ef\ufb01ciently.\nThe new space might have more or fewer dimensions depending on the goal. Applications include\nthe Johnson-Lindenstrauss Transform for dimensionality reduction (JLT, Johnson and Lindenstrauss,\n1984) and kernel methods with random feature maps (Rahimi and Recht, 2007). The embedding can\nbe costly hence many fast methods have been developed, see \u00a71.1 for background and related work.\nWe present a general class of random embeddings based on particular structured random matrices\nwith orthogonal rows, which we call random ortho-matrices (ROMs); see \u00a72. We show that ROMs\nmay be used for the applications above, in each case demonstrating improvements over previous\nmethods in statistical accuracy (measured by mean squared error, MSE), in computational ef\ufb01ciency\n(while providing similar accuracy), or both. We highlight the following contributions:\n\u2022 In \u00a73: The Orthogonal Johnson-Lindenstrauss Transform (OJLT) for dimensionality reduction.\nWe prove this has strictly smaller MSE than the previous unstructured JLT mechanisms. Further,\nOJLT is as fast as the fastest previous JLT variants (which are structured).\n\n\u2022 In \u00a74: Estimators for the angular kernel (Sidorov et al., 2014) which guarantee better MSE. The\nangular kernel is important for many applications, including natural language processing (Sidorov\net al., 2014), image analysis (J\u00e9gou et al., 2011), speaker representations (Schmidt et al., 2014)\nand tf-idf data sets (Sundaram et al., 2013).\n\n\u2022 In \u00a75: Two perspectives on the effectiveness of ROMs to help build intuitive understanding.\nIn \u00a76 we provide empirical results which support our analysis, and show that ROMs are effective for\na still broader set of applications. Full details and proofs of all results are in the Appendix.\n\n\u2217equal contribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Background and related work\n\nOur ROMs can have two forms (see \u00a72 for details): (i) a Gort is a random Gaussian matrix con-\nditioned on rows being orthogonal; or (ii) an SD-product matrix is formed by multiplying some\nnumber k of SD blocks, each of which is highly structured, typically leading to fast computation\nof products. Here S is a particular structured matrix, and D is a random diagonal matrix; see \u00a72\nfor full details. Our SD block generalizes an HD block, where H is a Hadamard matrix, which\nreceived previous attention. Earlier approaches to embeddings have explored using various structured\nmatrices, including particular versions of one or other of our two forms, though in different contexts.\n\nFor dimensionality reduction, Ailon and Chazelle (2006) used a single HD block as a way to spread\nout the mass of a vector over all dimensions before applying a sparse Gaussian matrix. Choromanski\nand Sindhwani (2016) also used just one HD block as part of a larger structure. Bojarski et al. (2017)\ndiscussed using k = 3 HD blocks for locality-sensitive hashing methods but gave no concrete results\nfor their application to dimensionality reduction or kernel approximation. All these works, and other\nearlier approaches (Hinrichs and Vyb\u00edral, 2011; Vyb\u00edral, 2011; Zhang and Cheng, 2013; Le et al.,\n2013; Choromanska et al., 2016), provided computational bene\ufb01ts by using structured matrices with\nless randomness than unstructured iid Gaussian matrices, but none demonstrated accuracy gains.\n\nYu et al. (2016) were the \ufb01rst to show that Gort-type matrices can yield improved accuracy, but their\ntheoretical result applies only asymptotically for many dimensions, only for the Gaussian kernel and\nfor just one speci\ufb01c orthogonal transformation, which is one instance of the larger class we consider.\nTheir theoretical result does not yield computational bene\ufb01ts. Yu et al. (2016) did explore using a\nnumber k of HD blocks empirically, observing good computational and statistical performance for\nk = 3, but without any theoretical accuracy guarantees. It was left as an open question why matrices\nformed by a small number of HD blocks can outperform non-discrete transforms.\n\nIn contrast, we are able to prove that ROMs yield improved MSE in several settings and for many of\nthem for any number of dimensions. In addition, SD-product matrices can deliver computational\nspeed bene\ufb01ts. We provide initial analysis to understand why k = 3 can outperform the state-of-\nthe-art, why odd k yields better results than even k, and why higher values of k deliver decreasing\nadditional bene\ufb01ts (see \u00a73 and \u00a75).\n\n2 The family of Random Ortho-Matrices (ROMs)\n\nRandom ortho-matrices (ROMs) are taken from two main classes of distributions de\ufb01ned below that\nrequire the rows of sampled matrices to be orthogonal. A central theme of the paper is that this\northogonal structure can yield improved statistical performance. We shall use bold uppercase (e.g.\nM) to denote matrices and bold lowercase (e.g. x) for vectors.\nGaussian orthogonal matrices. Let G be a random matrix taking values in Rm\u00d7n with iid N (0, 1)\nelements, which we refer to as an unstructured Gaussian matrix. The \ufb01rst ROM distribution we\nconsider yields the random matrix Gort, which is de\ufb01ned as a random Rn\u00d7n matrix given by \ufb01rst\ntaking the rows of the matrix to be a uniformly random orthonormal basis, and then independently\nscaling each row, so that the rows marginally have multivariate Gaussian N (0, I) distributions. The\nrandom variable Gort can then be extended to non-square matrices by either stacking independent\ncopies of the Rn\u00d7n random matrices, and deleting super\ufb02uous rows if necessary. The orthogonality\nof the rows of this matrix has been observed to yield improved statistical properties for randomized\nalgorithms built from the matrix in a variety of applications.\nSD-product matrices. Our second class of distributions is motivated by the desire to obtain similar\nstatistical bene\ufb01ts of orthogonality to Gort, whilst gaining computational ef\ufb01ciency by employing\nmore structured matrices. We call this second class SD-product matrices. These take the more\nn \u2200i, j \u2208\n{1, . . . , n}; and the (Di)k\ni=1 SDi, we\nmean the matrix product (SDk) . . . (SD1). This class includes as particular cases several recently\nintroduced random matrices (e.g. Andoni et al., 2015; Yu et al., 2016), where good empirical\nperformance was observed. We go further to establish strong theoretical guarantees, see \u00a73 and \u00a74.\n\ni=1 are independent diagonal matrices described below. By(cid:81)k\n\nstructured form(cid:81)k\n\ni=1 SDi, where S = {si,j} \u2208 Rn\u00d7n has orthogonal rows, |si,j| = 1\u221a\n\n2\n\n\f(cid:18)Hi\u22121 Hi\u22121\n\n(cid:19)\n\n2\n\nHi\u22121 \u2212Hi\u22121\n\nA prominent example of an S matrix is the normalized Hadamard matrix H, de\ufb01ned recursively by\nH1 = (1), and then for i > 1, Hi = 1\u221a\n. Importantly, matrix-vector products\nwith H are computable in O(n log n) time via the fast Walsh-Hadamard transform, yielding large\ncomputational savings. In addition, H matrices enable a signi\ufb01cant space advantage: since the\nfast Walsh-Hadamard transform can be computed without explicitly storing H, only O(n) space is\ni=1. Note that these Hn matrices are de\ufb01ned only for\nrequired to store the diagonal elements of (Di)k\nn a power of 2, but if needed, one can always adjust data by padding with 0s to enable the use of \u2018the\nnext larger\u2019 H, doubling the number of dimensions in the worst case.\nMatrices H are representatives of a much larger family in S which also attains computational savings.\nThese are L2-normalized versions of Kronecker-product matrices of the form A1 \u2297 ... \u2297 Al \u2208 Rn\u00d7n\nfor l \u2208 N, where \u2297 stands for a Kronecker product and blocks Ai \u2208 Rd\u00d7d have entries of the\nsame magnitude and pairwise orthogonal rows each. For these matrices, matrix-vector products are\ncomputable in O(n(2d \u2212 1) logd(n)) time (Zhang et al., 2015).\nn (\u22121)iN\u22121j0+...+i0jN\u22121\nS includes also the Walsh matrices W = {wi,j} \u2208 Rn\u00d7n, where wi,j = 1\u221a\nand iN\u22121...i0, jN\u22121...j0 are binary representations of i and j respectively.\nFor diagonal (Di)k\ni=1, we mainly consider Rademacher entries leading to the following matrices.\nDe\ufb01nition 2.1. The S-Rademacher random matrix with k \u2208 N blocks is below, where (D(R)\nare diagonal with iid Rademacher random variables [i.e. Unif({\u00b11})] on the diagonals:\n\n)k\ni=1\n\ni\n\nk(cid:89)\n\nM(k)\n\nSR =\n\nSD(R)\n\ni\n\n.\n\n(1)\n\nHaving established the two classes of ROMs, we next apply them to dimensionality reduction.\n\ni=1\n\n2] = (cid:107)x(cid:107)2\n\n3 The Orthogonal Johnson-Lindenstrauss Transform (OJLT)\nLet X \u2282 Rn be a dataset of n-dimensional real vectors. The goal of dimensionality reduction via\nrandom projections is to transform linearly each x \u2208 X by a random mapping x F(cid:55)\u2192 x(cid:48), where:\nF : Rn \u2192 Rm for m < n, such that for any x, y \u2208 X the following holds: (x(cid:48))(cid:62)y(cid:48) \u2248 x(cid:62)y. If\nwe furthermore have E[(x(cid:48))(cid:62)y(cid:48)] = x(cid:62)y then the dot-product estimator is unbiased. In particular,\nthis dimensionality reduction mechanism should in expectation preserve information about vectors\u2019\nnorms, i.e. we should have: E[(cid:107)x(cid:48)(cid:107)2\nm G, where G \u2208 Rm\u00d7n is as\nThe standard JLT mechanism uses the randomized linear map F = 1\u221a\nin \u00a72, requiring mn multiplications to evaluate. Several fast variants (FJLTs) have been proposed by\nreplacing G with random structured matrices, such as sparse or circulant Gaussian matrices (Ailon\nand Chazelle, 2006; Hinrichs and Vyb\u00edral, 2011; Vyb\u00edral, 2011; Zhang and Cheng, 2013). The fastest\nof these variants has O(n log n) time complexity, but at a cost of higher MSE for dot-products.\nOur Orthogonal Johnson-Lindenstrauss Transform (OJLT) is obtained by replacing the unstructured\nrandom matrix G with a sub-sampled ROM from \u00a72: either Gort, or a sub-sampled version M(k),sub\nof the S-Rademacher ROM, given by sub-sampling rows from the left-most S matrix in the product.\nWe sub-sample since m < n. We typically assume uniform sub-sampling without replacement. The\nresulting dot-product estimators for vectors x, y \u2208 X are given by:\n\n2 for any x \u2208 X .\n\n(cid:98)K base\n(cid:98)K ort\n(MSE) for these three estimators. Precisely, the MSE of an estimator (cid:98)K(x, y) of the inner product\n(cid:104)x, y(cid:105) for x, y \u2208 X is de\ufb01ned to be MSE((cid:98)K(x, y)) = E(cid:104)\n\n(cid:17)(cid:62)(cid:16)\n(cid:105)\n((cid:98)K(x, y) \u2212 (cid:104)x, y(cid:105)2)\n\nWe contribute the following closed-form expressions, which exactly quantify the mean-squared error\n\n(Gx)(cid:62)(Gy)\n1\nm\n(Gortx)(cid:62)(Gorty),\n1\nm\n\n[unstructured iid baseline, previous state-of-the-art accuracy],\n\n. See the Appendix\n\n(cid:98)K (k)\n\nM(k),sub\n\nSR y\n\nM(k),sub\n\nSR x\n\nm (x, y) =\n\n(cid:16)\n\n1\nm\n\nm (x, y) =\n\nm (x, y) =\n\nSR\n\n(cid:17)\n\n.\n\n(2)\n\nfor detailed proofs of these results and all others in this paper.\n\n3\n\n\fLemma 3.1. The MSE of the unstructured JLT dot-product estimator (cid:98)K base\ndimensional random feature maps is unbiased, with MSE((cid:98)K base\nTheorem 3.2. The estimator (cid:98)K ort\nMSE((cid:98)K ort\n=MSE((cid:98)K base\n(cid:18)(cid:18) 1\n(cid:20)\n\nm is unbiased and satis\ufb01es, for n \u2265 4:\n\nm (x, y))\nm (x, y)) +\n2(cid:107)y(cid:107)2\n\nm (x, y)) = 1\n\n(cid:107)x(cid:107)2\n\n(cid:19)\n\n2n2\n\nm\nm \u2212 1\n\n4I(n \u2212 3)I(n \u2212 4)\n\n\u2212 1\n\nn\n\nn + 2\n\n(I(n \u2212 3) \u2212 I(n \u2212 1))I(n \u2212 4)\n\n(cid:18) 1\n\n(cid:19)(cid:20)\n\nI(n \u2212 1) (I(n \u2212 4) \u2212 I(n \u2212 2))\n\nn \u2212 2\n\n\u2212 1\nn\n\ncos2(\u03b8) \u2212 1\n2\n\n\u221a\n\nwhere I(n) =(cid:82) \u03c0\nTheorem 3.3 (Key result). The OJLT estimator (cid:98)K (k)\nMSE((cid:98)K (k)\n\n(cid:18) n \u2212 m\n\n0 sinn(x)dx =\n\n\u03c0\u0393((n+1)/2)\n\u0393(n/2+1)\n\n(cid:19)(cid:18)\n\nm (x, y)) =\n\n.\n\n1\nm\n\nn \u2212 1\n\n((x(cid:62)y)2 + (cid:107)x(cid:107)2(cid:107)y(cid:107)2) +\n\nrandom feature maps and uniform sub-sampling policy without replacement, is unbiased with\n\nm (x, y) with k blocks, using m-dimensional\n\nm of x, y \u2208 Rn using m-\n2(cid:107)y(cid:107)2\nm ((x(cid:62)y)2 +(cid:107)x(cid:107)2\n2).\n\n(cid:20)\n(cid:21)(cid:19)\n\ncos2(\u03b8) +\n\n+\n\n1\n2\n\u2212 (cid:104)x, y(cid:105)2\n\n(cid:21)\n(cid:21)\n\n,\n\n(3)\n\nk\u22121(cid:88)\n\nr=1\n\n(\u22121)r2r\n\nnr\n\n(2(x(cid:62)y)2 + (cid:107)x(cid:107)2(cid:107)y(cid:107)2) +\n\nn(cid:88)\n\ni=1\n\n(\u22121)k2k\nnk\u22121\n\n(4)\n\n(cid:19)\n\nx2\ni y2\ni\n\n.\n\nProof (Sketch). For k = 1, the random projection matrix is given by sub-sampling rows from SD1,\nand the computation can be carried out directly. For k \u2265 1, the proof proceeds by induction.\nThe random projection matrix in the general case is given by sub-sampling rows of the matrix\nSDk \u00b7\u00b7\u00b7 SD1. By writing the MSE as an expectation and using the law of conditional expectations\nconditioning on the value of the \ufb01rst k \u2212 1 random matrices Dk\u22121, . . . , D1, the statement of the\ntheorem for 1 SD block and for k \u2212 1 SD blocks can be neatly combined to yield the result.\n\nm\n\nm (subsampling\n\nm (x, y)) > MSE((cid:98)K (2k+1)\n\nTo our knowledge, it has not previously been possible to provide theoretical guarantees that\nSD-product matrices outperform iid matrices. Combining Lemma 3.1 with Theorem 3.3 yields the\nfollowing important result.\n\nm .\nm ; we explore this empirically in \u00a76.\nTheorem 3.3 shows that there are diminishing MSE bene\ufb01ts to using a large number k of SD\n(x, y)) <\n(x, y)). These observations, and those in \u00a75, help to under-\n\nCorollary 3.4 (Theoretical guarantee of improved performance). Estimators (cid:98)K (k)\nwithout replacement) yield guaranteed lower MSE than (cid:98)K base\nm is better or worse than (cid:98)K (k)\nIt is not yet clear when (cid:98)K ort\nblocks. Interestingly, odd k is better than even: it is easy to observe that MSE((cid:98)K (2k\u22121)\nMSE((cid:98)K (2k)\n(cid:98)K (k)\na given datapoint x using (cid:98)K (k)\n\nstand why empirically k = 3 was previously observed to work well (Yu et al., 2016).\nIf we take S to be a normalized Hadamard matrix H, then even though we are using sub-sampling,\nand hence the full computational bene\ufb01ts of the Walsh-Hadamard transform are not available, still\nm achieves improved MSE compared to the base method with less computational effort, as follows.\n\nLemma 3.5. There exists an algorithm (see Appendix for details) which computes an embedding for\nm with S set to H and uniform sub-sampling policy in expected time\n\nmin{O((k \u2212 1)n log(n) + nm \u2212 (m\u22121)m\nNote that for m = \u03c9(k log(n)) or if k = 1, the time complexity is smaller than the brute force\n\u0398(nm). The algorithm uses a simple observation that one can reuse calculations conducted for the\nupper half of the Hadamard matrix while performing computations involving rows from its other half,\ninstead of running these calculations from scratch (details in the Appendix).\nAn alternative to sampling without replacement is deterministically to choose the \ufb01rst m rows. In our\nexperiments in \u00a76, these two approaches yield the same empirical performance, though we expect\n\n, kn log(n)}.\n\nm\n\n2\n\n4\n\n\fthat the deterministic method could perform poorly on adversarially chosen data. The \ufb01rst m rows\napproach can be realized in time O(n log(m) + (k \u2212 1)n log(n)) per datapoint.\nTheorem 3.3 is a key result in this paper, demonstrating that SD-product matrices yield both statistical\nand computational improvements compared to the base iid procedure, which is widely used in practice.\nWe next show how to obtain further gains in accuracy.\n\n3.1 Complex variants of the OJLT\n\nWe show that the MSE bene\ufb01ts of Theorem 3.3 may be markedly improved by using SD-product\nmatrices with complex entries M(k)\nSH. Speci\ufb01cally, we consider the variant S-Hybrid random matrix\nbelow, where D(U )\nis a diagonal matrix with iid Unif(S1) random variables on the diagonal, inde-\npendent of (D(R)\n)k\u22121\ni=1 , and S1 is the unit circle of C. We use the real part of the Hermitian product\nbetween projections as a dot-product estimator; recalling the de\ufb01nitions of \u00a72, we use:\n\nk\n\ni\n\nM(k)\n\nSH = SD(U )\n\nk\n\nSD(R)\n\ni\n\n,\n\nk\u22121(cid:89)\n\ni=1\n\n(cid:98)KH,(k)\n\nm (x, y) =\n\n(cid:20)(cid:16)\n\n1\nm\n\nRe\n\n(cid:17)(cid:62)(cid:16)\n\n(cid:17)(cid:21)\n\nM(k),sub\n\nSH x\n\nM(k),sub\n\nSH y\n\n.\n\n(5)\n\n2 MSE((cid:98)K (k)\n\nRemarkably, this complex variant yields exactly half the MSE of the OJLT estimator.\n\nTheorem 3.6. The estimator (cid:98)K\nunbiased and satis\ufb01es: MSE((cid:98)K\nThis large factor of 2 improvement could instead be obtained by doubling m for (cid:98)K (k)\n\nH,(k)\nm (x, y), applying uniform sub-sampling without replacement, is\nH,(k)\nm (x, y)) = 1\n\nm . However,\nthis would require doubling the number of parameters for the transform, whereas the S-Hybrid\nestimator requires additional storage only for the complex parameters in the matrix D(U )\nk . Strikingly,\nit is straightforward to extend the proof of Theorem 3.6 (see Appendix) to show that rather than\ntaking the complex random variables in M(k),sub\nto be Unif(S1), it is possible to take them to be\nUnif({1,\u22121, i,\u2212i}) and still obtain exactly the same bene\ufb01t in MSE.\n\nm (x, y)).\n\nSH\n\nH,(k)\nm\n\nde\ufb01ned in Equation (5): replacing the random matrix D(U )\n(which has iid Unif(S1) elements on the diagonal) with instead a random diagonal matrix having iid\nUnif({1,\u22121, i,\u2212i}) elements on the diagonal, does not affect the MSE of the estimator.\nIt is natural to wonder if using an SD-product matrix with more complex random variables (for all\nSD blocks) would improve performance still further. However, interestingly, this appears not to be\nthe case; details are provided in the Appendix \u00a78.7.\n\nk\n\nTheorem 3.7. For the estimator (cid:98)K\n\n3.2 Sub-sampling with replacement\n\nOur results above focus on SD-product matrices where rows have been sub-sampled without\nreplacement. Sometimes (e.g. for parallelization) it can be convenient instead to sub-sample with\nreplacement. As might be expected, this leads to worse MSE, which we can quantify precisely.\n\nTheorem 3.8. For each of the estimators (cid:98)K (k)\n\nm and (cid:98)K\n\nH,(k)\nm , if uniform sub-sampling with (rather\nthan without) replacement is used then the MSE is worsened by a multiplicative constant of n\u22121\nn\u2212m .\n\n4 Kernel methods with ROMs\n\nROMs can also be used to construct high-quality random feature maps for non-linear kernel\napproximation. We analyze here the angular kernel, an important example of a Pointwise Nonlinear\nGaussian kernel (PNG), discussed in more detail at the end of this section.\nDe\ufb01nition 4.1. The angular kernel K ang is de\ufb01ned on Rn by K ang(x, y) = 1 \u2212 2\u03b8x,y\nis the angle between x and y.\n\n\u03c0 , where \u03b8x,y\n\n5\n\n\fTo employ random feature style approximations to this kernel, we \ufb01rst observe it may be rewritten as\n\nK ang(x, y) = E [sign(Gx)sign(Gy)] ,\n\nwhere G \u2208 R1\u00d7n is an unstructured isotropic Gaussian vector. This motivates approximations of the\nform:\n\n(cid:98)K angm(x, y) =\n\nsign(Mx)(cid:62)sign(My),\n\n1\nm\n\nm\n\nm\n\nis unbiased and MSE((cid:98)K ang,base\n\n(6)\nwhere M \u2208 Rm\u00d7n is a random matrix, and the sign function is applied coordinate-wise. Such\nkernel estimation procedures are heavily used in practice (Rahimi and Recht, 2007), as they allow\nfast approximate linear methods to be used (Joachims, 2006) for inference tasks. If M = G, the\nunstructured Gaussian matrix, then we obtain the standard random feature estimator. We shall contrast\nthis approach against the use of matrices from the ROMs family.\nWhen constructing random feature maps for kernels, very often m > n. In this case, our structured\nmechanism can be applied by concatenating some number of independent structured blocks. Our\ntheoretical guarantees will be given just for one block, but can easily be extended to a larger number\nof blocks since different blocks are independent.\n\nfor approximating the angular kernel is\nde\ufb01ned by taking M to be G, the unstructured Gaussian matrix, in Equation (6), and satis\ufb01es the\nfollowing.\n\nThe standard random feature approximation (cid:98)K ang,base\nLemma 4.2. The estimator (cid:98)K ang,base\nThe MSE of an estimator (cid:98)K ang(x, y) of the true angular kernel K ang(x, y) is de\ufb01ned analogously\nstates that if we instead take M = Gort in Equation (6), then we obtain an estimator (cid:98)K ang,ort\nTheorem 4.3. Estimator (cid:98)K ang,ort\nWe also derive a formula for the MSE of an estimator (cid:98)K ang,M\n\nto the MSE of an estimator of the dot product, given in \u00a73. Our main result regarding angular kernels\nwith\n\nof the angular kernel which replaces G\nwith an arbitrary random matrix M and uses m random feature maps. The formula is helpful to see\nhow the quality of the estimator depends on the probabilities that the projections of the rows of M are\ncontained in some particular convex regions of the 2-dimensional space Lx,y spanned by datapoints\nx and y. For an illustration of the geometric de\ufb01nitions introduced in this Section, see Figure 1. The\nformula depends on probabilities involving events Ai = {sgn((ri)T x) (cid:54)= sgn((ri)T y)}, where\nri stands for the ith row of the structured matrix. Notice that Ai = {ri\nstands for the projection of ri into Lx,y and Cx,y is the union of two cones in Lx,y, each of angle \u03b8x,y.\nsatis\ufb01es the following, where: \u03b4i,j = P[Ai \u2229 Aj] \u2212 P[Ai]P[Aj]:\n\nproj \u2208 Cx,y}, where ri\n\nMSE((cid:98)K ang,ort\n\nis unbiased and satis\ufb01es:\n\n(x, y)) < MSE((cid:98)K ang,base\n\n(x, y)) = 4\u03b8x,y(\u03c0\u2212\u03b8x,y)\n\nstrictly smaller MSE, as follows.\n\n(x, y)).\n\nm\u03c02\n\nproj\n\nm\n\nm\n\nm\n\nm\n\nm\n\n.\n\nTheorem 4.4. Estimator (cid:98)K ang,M\n(cid:34)\nMSE((cid:98)K ang,M\n\n(x, y)) =\n\nm\n\nm\n\nm \u2212 m(cid:88)\n\n1\nm2\n\ni=1\n\nm\n\n(cid:35)\n\n\uf8ee\uf8f0 m(cid:88)\n\ni=1\n\n\uf8f9\uf8fb .\n\n(cid:88)\n\ni(cid:54)=j\n\n(1 \u2212 2P[Ai])2\n\n+\n\n4\nm2\n\n(P[Ai] \u2212 \u03b8x,y\n\u03c0\n\n)2 +\n\n\u03b4i,j\n\nNote that probabilities P[Ai] and \u03b4i,j depend on the choice of M. It is easy to prove that for\nunstructured G and Gort we have: P[Ai] = \u03b8x,y\n\u03c0 . Further, from the independence of the rows of\nG, \u03b4i,j = 0 for i (cid:54)= j. For unstructured G we obtain Lemma 4.2. Interestingly, we see that to\nprove Theorem 4.3, it suf\ufb01ces to show \u03b4i,j < 0, which is the approach we take (see Appendix). If\nwe replace G with M(k)\n\u03c0 does not depend on i. Hence, the\nangular kernel estimator based on Hadamard matrices gives smaller MSE estimator if and only if\n\nSR, then the expression \u0001 = P[Ai] \u2212 \u03b8x,y\ni(cid:54)=j \u03b4i,j + m\u00012 < 0. It is not yet clear if this holds in general.\n\n(cid:80)\n\nAs alluded to at the beginning of this section, the angular kernel may be viewed as a member of a wie\nfamily of kernels known as Pointwise Nonlinear Gaussian kernels.\n\n6\n\n\fFigure 1: Left part: Left: g1 is orthogonal to Lx,y. Middle: g1 \u2208 Lx,y. Right: g1 is close to orthogonal to\nLx,y. Right part: Visualization of the Cayley graph explored by the Hadamard-Rademacher process in two\ndimensions. Nodes are colored red, yellow, light blue, dark blue, for Cayley distances of 0, 1, 2, 3 from the\nidentity matrix respectively. See text in \u00a75.\n\nde\ufb01ned by K f (x, y) = E(cid:2)f (gT x)f (gT y)(cid:3), where g is a Gaussian vector with i.i.d N (0, 1) entries.\n\nDe\ufb01nition 4.5. For a given function f, the Pointwise Nonlinear Gaussian kernel (PNG) K f is\n\nMany prominent examples of kernels (Williams, 1998; Cho and Saul, 2009) are PNGs. Wiener\u2019s\ntauberian theorem shows that all stationary kernels may be approximated arbitrarily well by sums of\nPNGs (Samo and Roberts, 2015). In future work we hope to explore whether ROMs can be used to\nachieve statistical bene\ufb01t in estimation tasks associated with a wider range of PNGs.\n\n5 Understanding the effectiveness of orthogonality\n\nHere we build intuitive understanding for the effectiveness of ROMs. We examine geometrically the\nangular kernel (see \u00a74), then discuss a connection to random walks over orthogonal matrices.\n\nAngular kernel. As noted above for the Gort-mechanism, smaller MSE than that for unstructured\nG is implied by the inequality P[Ai \u2229Aj] < P[Ai]P[Aj], which is equivalent to: P[Aj|Ai] < P[Aj].\nNow it becomes clear why orthogonality is crucial. Without loss of generality take: i = 1, j = 2, and\nlet g1 and g2 be the \ufb01rst two rows of Gort.\nConsider \ufb01rst the extreme case (middle of left part of Figure 1), where all vectors are 2-dimensional.\nRecall de\ufb01nitions from just after Theorem 4.3. If g1 is in Cx,y then it is much less probable for\ng2 also to belong to Cx,y. In particular, if \u03b8 < \u03c0\n2 then the probability is zero. That implies the\ninequality. On the other hand, if g1 is perpendicular to Lx,y then conditioning on Ai does not have\nany effect on the probability that g2 belongs to Cx,y (left sub\ufb01gure of Figure 1). In practice, with high\nprobability the angle \u03c6 between g1 and Lx,y is close to \u03c0\n2 . That again implies\np of g1 into Lx,y to be in Cx,y, the more probable directions of\nthat conditioned on the projection g1\np (see: ellipsoid-like shape in the right sub\ufb01gure of Figure 1 which is the\np are perpendicular to g1\ng2\nprojection of the sphere taken from the (n \u2212 1)-dimensional space orthogonal to g1 into Lx,y). This\nmakes it less probable for g2\n2 , but this is what\nprovides superiority of the orthogonal transformations over state-of-the-art ones in the angular kernel\napproximation setting.\n\np to be also in Cx,y. The effect is subtle since \u03c6 \u2248 \u03c0\n\n2 , but is not exactly \u03c0\n\nMarkov chain perspective. We focus on Hadamard-Rademacher random matrices HDk...HD1,\na special case of the SD-product matrices described in Section 2. Our aim is to provide intuition\nfor how the choice of k affects the quality of the random matrix, following our earlier observations\njust after Corollary 3.4, which indicated that for SD-product matrices, odd values of k yield greater\nbene\ufb01ts than even values, and that there are diminishing bene\ufb01ts from higher values of k. We proceed\nby casting the random matrices into the framework of Markov chains.\nDe\ufb01nition 5.1. The Hadamard-Rademacher process in n dimensions is the Markov chain (Xk)\u221e\nk=0\ntaking values in the orthogonal group O(n), with X0 = I almost surely, and Xk = HDkXk\u22121\nalmost surely, where H is the normalized Hadamard matrix in n dimensions, and (Dk)\u221e\nk=1 are iid\ndiagonal matrices with independent Rademacher random variables on their diagonals.\nConstructing an estimator based on Hadamard-Rademacher matrices is equivalent to simulating\nseveral time steps from the Hadamard-Rademacher process. The quality of estimators based on\nHadamard-Rademacher random matrices comes from a quick mixing property of the corresponding\n\n7\n\n\f(a) g50c - pointwise evalu-\nation MSE for inner product\nestimation\n\n(b) random - angular kernel (c) random - angular kernel\n\nwith true angle \u03c0/4\n\n(d) g50c - inner product es-\ntimation MSE for variants of\n3-block SD-product matri-\nces.\n\n(e) LETTER - dot-product\n\n(f) USPS - dot-product\n\n(g) LETTER - angular kernel\n\n(h) USPS - angular kernel\n\nFigure 2: Top row: MSE curves for pointwise approximation of inner product and angular kernels on the\ng50c dataset, and randomly chosen vectors. Bottom row: Gram matrix approximation error for a variety of\ndata sets, projection ranks, transforms, and kernels. Note that the error scaling is dependent on the application.\n\nMarkov chain. The following demonstrates attractive properties of the chain in low dimensions.\n\nProposition 5.2. The Hadamard-Rademacher process in two dimensions: explores a state-space of\n16 orthogonal matrices, is ergodic with respect to the uniform distribution on this set, has period 2,\nthe diameter of the Cayley graph of its state space is 3, and the chain is fully mixed after 3 time steps.\nThis proposition, and the Cayley graph corresponding to the Markov chain\u2019s state space (Figure 1\nright), illustrate the fast mixing properties of the Hadamard-Rademacher process in low dimensions;\nthis agrees with the observations in \u00a73 that there are diminishing returns associated with using a large\nnumber k of HD blocks in an estimator. The observation in Proposition 5.2 that the Markov chain\nhas period 2 indicates that we should expect different behavior for estimators based on odd and even\nnumbers of blocks of HD matrices, which is re\ufb02ected in the analytic expressions for MSE derived in\nTheorems 3.3 and 3.6 for the dimensionality reduction setup.\n\n6 Experiments\n\nWe present comparisons of estimators introduced in \u00a73 and \u00a74, illustrating our theoretical results, and\nfurther demonstrating the empirical success of ROM-based estimators at the level of Gram matrix\napproximation. We compare estimators based on: unstructured Gaussian matrices G, matrices Gort,\nS-Rademacher and S-Hybrid matrices with k = 3 and different sub-sampling strategies. Results\nfor k > 3 do not show additional statistical gains empirically. Additional experimental results,\nincluding a comparison of estimators using different numbers of SD blocks, are in the Appendix \u00a710.\nThroughout, we use the normalized Hadamard matrix H for the structured matrix S.\n\n6.1 Pointwise kernel approximation\n\nComplementing the theoretical results of \u00a73 and \u00a74, we provide several salient comparisons of the\nvarious methods introduced - see Figure 2 top. Plots presented here (and in the Appendix) compare\nMSE for dot-product and angular and kernel. They show that estimators based on Gort, S-Hybrid\nand S-Rademacher matrices without replacement, or using the \ufb01rst m rows, beat the state-of-the-art\nunstructured G approach on accuracy for all our different datasets in the JLT setup. Interestingly, the\nlatter two approaches give also smaller MSE than Gort-estimators. For angular kernel estimation,\nwhere sampling is not relevant, we see that Gort and S-Rademacher approaches again outperform\nthe ones based on matrices G.\n\n8\n\n\f6.2 Gram matrix approximation\n\nMoving beyond the theoretical guarantees established in \u00a73 and \u00a74, we show empirically that the\nsuperiority of estimators based on ROMs is maintained at the level of Gram matrix approximation.\nWe compute Gram matrix approximations (with respect to both standard dot-product, and angular\n\nkernel) for a variety of datasets. We use the normalized Frobenius norm error (cid:107)K \u2212 (cid:98)K(cid:107)2/(cid:107)K(cid:107)2\n\nas our metric (as used by Choromanski and Sindhwani, 2016), and plot the mean error based on\n1,000 repetitions of each random transform - see Figure 2 bottom. The Gram matrices are computed\non a randomly selected subset of 550 data points from each dataset. As can be seen, the S-Hybrid\nestimators using the \u201cno-replacement\u201d or \u201c\ufb01rst m rows\u201d sub-sampling strategies outperform even\nthe orthogonal Gaussian ones in the dot-product case. For the angular case, the Gort-approach and\nS-Rademacher approach are practically indistinguishable.\n\n7 Conclusion\n\nWe de\ufb01ned the family of random ortho-matrices (ROMs). This contains the SD-product matrices,\nwhich include a number of recently proposed structured random matrices. We showed theoretically\nand empirically that ROMs have strong statistical and computational properties (in several cases\noutperforming previous state-of-the-art) for algorithms performing dimensionality reduction and\nrandom feature approximations of kernels. We highlight Corollary 3.4, which provides a theoretical\nguarantee that SD-product matrices yield better accuracy than iid matrices in an important dimension-\nality reduction application (we believe the \ufb01rst result of this kind). Intriguingly, for dimensionality\nreduction, using just one complex structured matrix yields random features of much better quality.\nWe provided perspectives to help understand the bene\ufb01ts of ROMs, and to help explain the behavior\nof SD-product matrices for various numbers of blocks. Our empirical \ufb01ndings suggest that our\ntheoretical results might be further strengthened, particularly in the kernel setting.\n\nAcknowledgements\n\nWe thank Vikas Sindhwani at Google Brain Robotics and Tamas Sarlos at Google Research for\ninspiring conversations that led to this work. We thank Matej Balog, Maria Lomeli, Jiri Hron and\nDave Janz for helpful comments. MR acknowledges support by the UK Engineering and Physical\nSciences Research Council (EPSRC) grant EP/L016516/1 for the University of Cambridge Centre\nfor Doctoral Training, the Cambridge Centre for Analysis. AW acknowledges support by the Alan\nTuring Institute under the EPSRC grant EP/N510129/1, and by the Leverhulme Trust via the CFI.\n\n9\n\n\fReferences\nN. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In\n\nSTOC, 2006.\n\nA. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular\n\ndistance. In NIPS, 2015.\n\nM. Bojarski, A. Choromanska, K. Choromanski, F. Fagan, C. Gouy-Pailler, A. Morvan, N. Sakr, T. Sarlos, and\nJ. Atif. Structured adaptive and random spinners for fast machine learning computations. In to appear in\nAISTATS, 2017.\n\nY. Cho and L. K. Saul. Kernel methods for deep learning. In NIPS, 2009.\n\nA. Choromanska, K. Choromanski, M. Bojarski, T. Jebara, S. Kumar, and Y. LeCun. Binary embeddings with\n\nstructured hashed projections. In ICML, 2016.\n\nK. Choromanski and V. Sindhwani. Recycling randomness with structure for sublinear time kernel expansions.\n\nIn ICML, 2016.\n\nA. Hinrichs and J. Vyb\u00edral. Johnson-Lindenstrauss lemma for circulant matrices. Random Structures &\n\nAlgorithms, 39(3):391\u2013398, 2011.\n\nH. J\u00e9gou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 33(1):117\u2013128, 2011.\n\nThorsten Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, KDD \u201906, pages 217\u2013226, New York, NY, USA,\n2006. ACM. ISBN 1-59593-339-5. doi: 10.1145/1150402.1150429. URL http://doi.acm.org/10.\n1145/1150402.1150429.\n\nW. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary\n\nMathematics, 26:189\u2013206, 1984.\n\nQ. Le, T. Sarl\u00f3s, and A. Smola. Fastfood - approximating kernel expansions in loglinear time. In ICML, 2013.\n\nA. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n\nY.-L. K. Samo and S. Roberts. Generalized spectral kernels. CoRR, abs/1506.02236, 2015.\n\nL. Schmidt, M. Shari\ufb01, and I. Moreno. Large-scale speaker identi\ufb01cation. In Acoustics, Speech and Signal\n\nProcessing (ICASSP), 2014 IEEE International Conference on, pages 1650\u20131654. IEEE, 2014.\n\nG. Sidorov, A. Gelbukh, H. G\u00f3mez-Adorno, and D. Pinto. Soft similarity and soft cosine measure: Similarity of\n\nfeatures in vector space model. Computaci\u00f3n y Sistemas, 18(3), 2014.\n\nN. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming\nsimilarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB\nEndowment, 6(14):1930\u20131941, 2013.\n\nJ. Vyb\u00edral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. Journal of Functional Analysis,\n\n260(4):1096\u20131105, 2011.\n\nC. Williams. Computation with in\ufb01nite neural networks. Neural Computation, 10(5):1203\u20131216, 1998.\n\nF. Yu, A. Suresh, K. Choromanski, D. Holtmann-Rice, and S. Kumar. Orthogonal random features. In NIPS,\n\npages 1975\u20131983, 2016.\n\nH. Zhang and L. Cheng. New bounds for circulant Johnson-Lindenstrauss embeddings. CoRR, abs/1308.6339,\n\n2013.\n\nXu Zhang, Felix X. Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, and Shih-Fu Chang. Fast orthogonal\nprojection based on kronecker product. In 2015 IEEE International Conference on Computer Vision, ICCV\n2015, Santiago, Chile, December 7-13, 2015, pages 2929\u20132937, 2015. doi: 10.1109/ICCV.2015.335. URL\nhttp://dx.doi.org/10.1109/ICCV.2015.335.\n\n10\n\n\f", "award": [], "sourceid": 174, "authors": [{"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Brain Robotics"}, {"given_name": "Mark", "family_name": "Rowland", "institution": "University of Cambridge"}, {"given_name": "Adrian", "family_name": "Weller", "institution": "University of Cambridge"}]}