{"title": "Simple strategies for recovering inner products from coarsely quantized random projections", "book": "Advances in Neural Information Processing Systems", "page_first": 4567, "page_last": 4576, "abstract": "Random projections have been increasingly adopted for a diverse set of tasks in machine learning involving dimensionality reduction. One specific line of research on this topic has investigated the use of quantization subsequent to projection with the aim of additional data compression. Motivated by applications in nearest neighbor search and linear learning, we revisit the problem of recovering inner products (respectively cosine similarities) in such setting. We show that even under coarse scalar quantization with 3 to 5 bits per projection, the loss in accuracy tends to range from ``negligible'' to ``moderate''. One implication is that in most scenarios of practical interest, there is no need for a sophisticated recovery approach like maximum likelihood estimation as considered in previous work on the subject. What we propose herein also yields considerable improvements in terms of accuracy over the Hamming distance-based approach in Li et al. (ICML 2014) which is comparable in terms of simplicity", "full_text": "Simple Strategies for Recovering Inner Products from\n\nCoarsely Quantized Random Projections\n\nPing Li\n\nBaidu Research, and\nRutgers University\n\npingli98@gmail.com\n\nMartin Slawski\n\nDepartment of Statistics\nGeorge Mason University\n\nmslawsk3@gmu.edu\n\nAbstract\n\nRandom projections have been increasingly adopted for a diverse set of tasks in\nmachine learning involving dimensionality reduction. One speci\ufb01c line of research\non this topic has investigated the use of quantization subsequent to projection\nwith the aim of additional data compression. Motivated by applications in nearest\nneighbor search and linear learning, we revisit the problem of recovering inner\nproducts (respectively cosine similarities) in such setting. We show that even under\ncoarse scalar quantization with 3 to 5 bits per projection, the loss in accuracy tends\nto range from \u201cnegligible\u201d to \u201cmoderate\u201d. One implication is that in most scenarios\nof practical interest, there is no need for a sophisticated recovery approach like\nmaximum likelihood estimation as considered in previous work on the subject.\nWhat we propose herein also yields considerable improvements in terms of accuracy\nover the Hamming distance-based approach in Li et al. (ICML 2014) which is\ncomparable in terms of simplicity.\n\n1\n\nIntroduction\n\nThe method of random projections (RPs) for linear dimensionality reduction has become more\nand more popular over the years after the basic theoretical foundation, the celebrated Johnson-\nLindenstrauss (JL) Lemma [12, 20, 33], had been laid out. In a nutshell, it states that it is possible\nto considerably lower the dimension of a set of data points by means of a linear map in such a way\nthat squared Euclidean distances and inner products are roughly preserved in the low-dimensional\nrepresentation. Conveniently, a linear map of this sort can be realized by a variety of random\nmatrices [1, 2, 18]. The scope of applications of RPs has expanded dramatically in the course of\ntime, and includes dimension reduction in linear classi\ufb01cation and regression [14, 30], similarity\nsearch [5, 17], compressed sensing [8], clustering [7, 11], randomized numerical linear algebra and\nmatrix sketching [29], and differential privacy [21], among others.\nThe idea of achieving further data compression by means of subsequent scalar quantization of the\nprojected data has been considered for a while. Such setting can be motivated from constraints\nconcerning data storage and communication, locality-sensitive hashing [13, 27], or the enhancement\nof privacy [31]. The extreme case of one-bit quantization can be associated with two seminal works\nin computer science, the SDP relaxation of the MAXCUT problem [16] and the simhash [10]. One-bit\ncompressed sensing is introduced in [6], and along with its numerous extensions, has meanwhile\ndeveloped into a sub\ufb01eld within the compressed sensing literature. A series of recent papers discuss\nquantized RPs with a focus on similarity estimation and search. The papers [25, 32] discuss quantized\nRPs with a focus on image retrieval based on nearest neighbor search. Independent of the speci\ufb01c\napplication, [25, 32] provide JL-type statements for quantized RPs, and consider the trade-off between\nthe number of projections and the number of bits per projection under a given budget of bits as it also\nappears in the compressed sensing literature [24]. The paper [19] studies approximate JL-type results\nfor quantized RPs in detail. The approach to quantized RPs taken in the present paper follows [27, 28]\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin which the problem of recovering distances and inner products is recast within the framework of\nclassical statistical point estimation theory. The paper [28] discusses maximum likelihood estimation\nin this context, with an emphasis on the aforementioned trade-off between the number of RPs and the\nbit depth per projection. In the present paper we focus on the much simpler and computationally much\nmore convenient approach in which the presence of the quantizer is ignored, i.e., quantized data are\ntreated in the same way as full-precision data. We herein quantify the loss of accuracy of this approach\nrelative to the full-precision case, which turns out to be insigni\ufb01cant in many scenarios of practical\ninterest even under coarse quantization with 3 to 5 bits per projection. Moreover, we show that\nthe approach compares favorably to the Hamming distance-based (or equivalently collision-based)\nscheme in [27] which is of similar simplicity. We argue that both approaches have their merits: the\ncollision-based scheme performs better in preserving local geometry (the distances of nearby points),\nwhereas the one studied in more detail herein yields better preservation globally.\nNotation. For a positive integer m, we let [m] = {1, . . . , m}. For l \u2208 [m], v(l) denotes the l-th\ncomponent of a vector v \u2208 Rm; if there is no danger of confusion with another index, the brackets in\nthe subscript are omitted. I(P ) denotes the indicator function of expression P .\n\nSupplement: Proofs and additional experimental results can be found in the supplement.\nBasic setup. Let X = {x1, . . . , xn} \u2282 Rd be a set of input data with squared Euclidean norms\n2, i \u2208 [n]. We think of d being large. RPs reduce the dimensionality of the input data\ni := (cid:107)xi(cid:107)2\n\u03bb2\nby means of a linear map A : Rd \u2192 Rk, k (cid:28) d. We assume throughout the paper that the map\nA is realized by a random matrix with i.i.d. entries from the standard Gaussian distribution, i.e.,\nAlj \u223c N (0, 1), l \u2208 [k], j \u2208 [d]. One standard goal of RPs is to approximately preserve distances in\nX while lowering the dimension, i.e., (cid:107)Axi \u2212 Axj(cid:107)2\n2 for all (i, j). This is implied\nby approximate inner product preservation (cid:104)xi, xj(cid:105) \u2248 (cid:104)Axi, Axj(cid:105) /k for all (i, j).\nFor the time being, we assume that it is possible to compute and store the squared norms {\u03bb2\n\nand to rescale the input data to unit norm, i.e., one \ufb01rst forms(cid:101)xi \u2190 xi/\u03bbi, i \u2208 [n], before applying\ni}n\ni=1,\n= (cid:104)(cid:101)xi,(cid:101)xj(cid:105), i, j \u2208 [n], of\nthe input data X from their compressed representation Z = {z1, . . . , zn}, zi := A(cid:101)xi, i \u2208 [n].\n\nA. In this case, it suf\ufb01ces to recover the (cosine) similarities \u03c1ij :=\n\n2/k \u2248 (cid:107)xi \u2212 xj(cid:107)2\n\n(cid:104)xi,xj(cid:105)\n\u03bbi\u03bbj\n\n2 Estimation of cosine similarity based on full-precision RPs\n\nAs preparation for later sections, we start by providing background concerning the usual setting\nwithout quantization. Let (Z, Z(cid:48))r be random variables having a bivariate Gaussian distribution with\nzero mean, unit variance, and correlation r \u2208 (\u22121, 1):\n\n(Z, Z(cid:48))r \u223c N2\n\n0\n\nLet further x, x(cid:48) be a generic pair of points from X , and let z := A(cid:101)x, z(cid:48) := A(cid:101)x(cid:48) be the counterpart in\nl=1 of (z, z(cid:48)) are distributed i.i.d. as in (1) with r = \u03c1 =: (cid:104)(cid:101)x,(cid:101)x(cid:48)(cid:105).\n(l))}k\n\nZ. Then the components {(z(l), z(cid:48)\nHence the problem of recovering the cosine similarity of x and x(cid:48) can be re-cast as estimating the\ncorrelation from an i.i.d. sample of k bivariate Gaussian random variables. To simplify our exposition,\nwe henceforth assume that 0 \u2264 \u03c1 < 1 as this can easily be achieved by \ufb02ipping the sign of one of x\nor x(cid:48). The standard estimator of \u03c1 is what is called the \u201clinear estimator\u201d herein:\n\nr\n\n(1)\n\n(cid:18)(cid:18)0\n(cid:19)\n\n(cid:18)1\n\n,\n\n(cid:19)(cid:19)\n\n.\n\nr\n1\n\n(cid:98)\u03c1lin =\n\n(cid:104)z, z(cid:48)(cid:105) =\n\n1\nk\n\n1\nk\n\nk(cid:88)\n\nl=1\n\nz(l)z(cid:48)\n\n(l).\n\n(2)\n\n(cid:26)\n\n(cid:98)\u03c1MLE = argmax\n\nAs pointed out in [26] this estimator can be considerably improved upon by the maximum likelihood\nestimator (MLE) given (1):\n\u2212 1\n2\n\nThe estimator(cid:98)\u03c1MLE is not available in closed form, which is potentially a serious concern since it\n\nneeds to be evaluated for numerous different pairs of data points. However, this can be addressed\n\nlog(1 \u2212 r2) \u2212 1\n2\n\n(cid:104)z, z(cid:48)(cid:105) 2r\n\n(cid:18) 1\n\n(cid:107)z(cid:48)(cid:107)2\n\n2 \u2212 1\nk\n\n(cid:107)z(cid:107)2\n\n2 +\n\n1\n\n1 \u2212 r2\n\n(cid:19)(cid:27)\n\n.\n\n(3)\n\n1\nk\n\nk\n\nr\n\n2\n\n\f(cid:110)(cid:16)(cid:107)z(cid:107)2\n\n(cid:17)\n\n(cid:111)\n\n(6)\n(7)\n\n(8)\n\nby tabulation of the two statistics\n\n(cid:98)\u03c1MLE over a suf\ufb01ciently \ufb01ne grid. At processing time, computation of(cid:98)\u03c1MLE can then be reduced to a\nOne obvious issue of(cid:98)\u03c1lin is that it does not respect the range of the underlying parameter. A natural\n\nlook-up in a pre-computed table.\n\nand the corresponding solutions\n\n/k, (cid:104)z, z(cid:48)(cid:105) /k\n\n2 + (cid:107)z(cid:48)(cid:107)2\n\n2\n\n\ufb01x is the use of the \u201cnormalized linear estimator\u201d\n\n(cid:98)\u03c1norm = (cid:104)z, z(cid:48)(cid:105) /((cid:107)z(cid:107)2 (cid:107)z(cid:48)(cid:107)2).\n\u03c1((cid:98)\u03c1) + Var\u03c1((cid:98)\u03c1),\n\n(4)\nWhen comparing different estimators of \u03c1 in terms of statistical accuracy, we evaluate the mean\nsquared error (MSE), possibly asymptotically as the number of RPs k \u2192 \u221e. Speci\ufb01cally, we consider\n\nMSE\u03c1((cid:98)\u03c1) = E\u03c1[(\u03c1 \u2212(cid:98)\u03c1)2] = Bias2\n\nwhere(cid:98)\u03c1 is some estimator, and the subscript \u03c1 indicates that expectations are taken with respect to a\nIt turns out that(cid:98)\u03c1norm and(cid:98)\u03c1MLE can have dramatically lower (asymptotic) MSEs than(cid:98)\u03c1lin for large\n\nsample (z, z(cid:48)) following the bivariate normal distribution in (1) with r = \u03c1.\n\nvalues of \u03c1, i.e., for points of high cosine similarity. It can be shown that (cf. [4], p.132, and [26])\n\n(5)\n\nBias\u03c1((cid:98)\u03c1) := E\u03c1[(cid:98)\u03c1] \u2212 \u03c1,\n\nBias\u03c1((cid:98)\u03c1lin) = 0,\n\nVar\u03c1((cid:98)\u03c1lin) = (1 + \u03c12)/k,\n\u03c1((cid:98)\u03c1norm) = O(1/k2), Var\u03c1((cid:98)\u03c1norm) = (1 \u2212 \u03c12)2/k + O(1/k2),\n\u03c1((cid:98)\u03c1MLE) = O(1/k2), Var\u03c1((cid:98)\u03c1MLE) = (1\u2212\u03c12)2\n\n1+\u03c12 /k + O(1/k2).\n\nBias2\n\nBias2\n\nWhile for \u03c1 = 0, the (asymptotic) MSEs are the same, we note that the leading terms of the MSEs\n\nof(cid:98)\u03c1norm and(cid:98)\u03c1MLE decay at rate \u0398((1 \u2212 \u03c1)2) as \u03c1 \u2192 1, whereas the MSE of(cid:98)\u03c1lin grows with \u03c1. The\nfollowing table provides the asymptotic MSE ratios of(cid:98)\u03c1lin and(cid:98)\u03c1norm for selected values of \u03c1.\n\n\u03c1\n\nMSE\u03c1((cid:98)\u03c1lin)\nMSE\u03c1((cid:98)\u03c1norm)\n\n0.5\n\n2.2\n\n0.6\n\n3.3\n\n0.7\n\n5.7\n\n0.8\n\n12.6\n\n0.9\n\n50\n\n0.95\n\n0.99\n\n200\n\n5000\n\nIn conclusion, if it is possible to pre-compute and store the norms of the data prior to dimensionality\nreduction, a simple form of normalization can yield important bene\ufb01ts with regard to the recovery of\ninner products and distances for pairs of points having high cosine similarity. The MLE can provide\n\na further re\ufb01nement, but the improvement over(cid:98)\u03c1norm can be at most by a factor of 2.\n\n3 Estimation of cosine similarity based on quantized RPs\n\nThe following section contains our main results. After introducing preliminaries regarding quantiza-\ntion, we review previous approaches to the problem, before analyzing estimators following a different\nparadigm. We conclude with a comparison and some recommendations about what to use in practice.\nQuantization. After obtaining the projected data Z, the next step is scalar quantization. Let\nt = (t1, . . . , tK\u22121) with 0 = t0 < t1 < . . . < tK\u22121 < tK = +\u221e be a set of thresholds\ninducing a partitioning of the positive real line into K intervals {[ts\u22121, ts), s \u2208 [K]}, and let\nM = {\u00b51, . . . , \u00b5K} be a set of codes with \u00b5s representing interval [ts\u22121, ts), s \u2208 [K]. Given t and\nM, the scalar quantizer (or quantization map) is de\ufb01ned by\n\nQ : R \u2192 M\u00b1 := \u2212M \u222a M,\n\ns=1 \u00b5sI(|z| \u2208 [ts\u22121, ts)).\n\ni=1 \u2282 (M\u00b1)k, qi =(cid:0) Q(zi(l) )(cid:1)k\n\n(9)\n\nThe projected and quantized data result as Q = {qi}n\nl=1, where zi(l)\ndenotes the l-th component of zi \u2208 Z, l \u2208 [k], i \u2208 [n]. The bit depth b of the quantizer is given by\nb := 1 + log2(K). For simplicity, we only consider the case where b is an integer. The case b = 1 is\nwell-studied [10, 27] and is hence disregarded in our analysis to keep our exposition compact.\nBin-based vs. code-based approaches. Let q = Q(z) and q(cid:48) = Q(z(cid:48)) be the points resulting from\nquantization of the generic pair z, z(cid:48) in the previous section. In this paper, we distinguish between\ntwo basic paradigms for estimating the cosine similarity of the underlying pair x, x(cid:48) from q, q(cid:48). The\n\ufb01rst paradigm, which we refer to as bin-based estimation, does not make use of the speci\ufb01c values of\n\nz (cid:55)\u2192 Q(z) = sign(z)(cid:80)K\n\n3\n\n\fthe codes M\u00b1, but only of the intervals (\u201cbins\u201d) associated with each code. This is opposite to the\nsecond paradigm, referred to as code-based estimation which only makes use of the values of the\ncodes. As we elaborate below, an advantage of the bin-based approach is that working with intervals\nre\ufb02ects the process of quantization more faithfully and hence can be statistically more accurate; on the\nother hand, a code-based approach tends to be more convenient from the point of view computation.\nIn this paper, we make a case for the code-based approach by showing that the loss in statistical\naccuracy can be fairly minor in several regimes of practical interest.\n\nLloyd-Max (LM) quantizer. With b respectively K being \ufb01xed, one needs to choose the thresholds\nt and the codes M of the quantizer (the second is crucial only for a code-based approach). In our\nsetting, with zi(l) \u223c N (0, 1), i \u2208 [n], l \u2208 [k], which is inherited from the distribution of the entries\nof A, a standard choice is LM quantization [15] which minimizes the squared distortion error:\n\n(t(cid:63), \u00b5(cid:63)) = argmin\n\nt,\u00b5\n\nEg\u223cN (0,1)[{g \u2212 Q(g; t, \u00b5)}2].\n\n(10)\n\nProblem (10) can be solved by an iterative scheme that alternates between optimization of t for \ufb01xed\n\u00b5 and vice versa. That scheme can be shown to deliver the global optimum [22]. In the absence of\nany prior information about the cosine similarities that we would like to recover, (10) appears as a\nreasonable default whose use for bin-based estimation has been justi\ufb01ed in [28]. In the limit of cosine\nsimilarity \u03c1 \u2192 1, it may seem more plausible to use (10) with g replaced by its square, and taking the\nroot of the resulting optimal thresholds and codes. However, it turns out that empirically this yields\nreduced performance more often than improvements, hence we stick to (10) in the sequel.\n\n(l)\n\nl=1 and q(cid:48) = (q(cid:48)\n\n3.1 Bin-based approaches\nMLE. Given a pair q = (q(l) )k\nl=1 of projected and quantized points, max-\n)k\nimum likelihood estimation of the underlying cosine similarity \u03c1 is studied in depth in [28].\nThe associated likelihood function L(r) is based on bivariate normal probabilities of the form\nPr(Z \u2208 [ts\u22121, ts), Z(cid:48) \u2208 [tu\u22121, tu)), P\u2212r(Z \u2208 [ts\u22121, ts), Z(cid:48) \u2208 [tu\u22121, tu)) with (Z, Z(cid:48))r as in (1).\nIt is shown in [28] that the MLE with b \u2265 2 can be more ef\ufb01cient at the bit level than common\nsingle-bit quantization [10, 16]; the optimal choice of b increases with \u03c1. While statistically optimal in\nthe given setting, the MLE remains computationally cumbersome even when using the approximation\nin [28] because it requires cross-tabulation of the empirical frequencies corresponding to the bivariate\nnormal probabilities above. This makes the use of the MLE unattractive particularly in situations in\nwhich it is not feasible to materialize all O(n2) pairwise similarities estimable from (qi, qj)i<j so\nthat they would need to be re-evaluated frequently.\n\nas the MLE. The similarity \u03c1 is estimated as(cid:98)\u03c1col = \u03b8\u22121(cid:16)(cid:80)k\nCollision-based estimator. The collision-based estimator proposed in [27] is a bin-based approach\n, where the map\nincreasing in [27]. Compared to the MLE,(cid:98)\u03c1col uses less information \u2013 it only counts \u201ccollisions\u201d,\n\u03b8 : [0, 1] \u2192 [0, 1] is de\ufb01ned by r (cid:55)\u2192 \u03b8(r) = Pr(Q(Z) = Q(Z(cid:48))), shown to be monotonically\n}. The loss in statistical ef\ufb01ciency is moderate for b = 2, in particular for \u03c1\ni.e., events {q(l) = q(cid:48)\nthe positive side,(cid:98)\u03c1col is convenient to compute given that the evaluation of the function \u03b8\u22121 can be\nclose to 1. However, as b increases that loss becomes more and more substantial; cf. Figure 1. On\n\napproximated by employing a look-up table after tabulating \u03b8 on a \ufb01ne grid.\n\nl=1 I(q(l) = q(cid:48)\n\n(l)\n\n)/k\n\n(cid:17)\n\n(l)\n\nFigure 1: (L): Asymptotic MSEs [27] of(cid:98)\u03c1col (to be divided by k) for 2 \u2264 b \u2264 4. (M,R): Asymptotic\nrelative ef\ufb01ciencies MSE\u03c1((cid:98)\u03c1col)/MSE\u03c1((cid:98)\u03c1MLE) for b \u2208 {2, 4}, where(cid:98)\u03c1MLE is the MLE in [28].\n\n4\n\n00.20.40.60.81-3-2.5-2-1.5-1-0.500.511.5log10(MSE)b = 2b = 3b = 40.20.40.60.8112345Relative Efficiencyb = 200.20.40.60.811102030Relative Efficiencyb = 4\fb\n\n2\n3\n4\n5\n6\n\nbound on bias2\n\n1.2 \u00b7 10\u22121\n7.2 \u00b7 10\u22123\n4.5 \u00b7 10\u22124\n2.8 \u00b7 10\u22125\n1.8 \u00b7 10\u22126\n\nobtained from Theorem 1 by setting \u03c1 = 1. (R): Var\u03c1((cid:98)\u03c1lin) (to be divided by k).\n\n\u03c1((cid:98)\u03c1lin) and the bound of Theorem 1. (M): uniform upper bounds on Bias2\n\nFigure 2: (L): Bias2\n\n\u03c1((cid:98)\u03c1lin)\n\n3.2 Code-based approaches\n\nIn the code-based approach, we simply ignore the fact that the quantized data actually represent\nintervals and treat them precisely in the same way as full-precision data. Recovery of cosine similarity\nis performed by means of the estimator in \u00a72 with z, z(cid:48) replaced by q, q(cid:48). Perhaps surprisingly, it\nturns out that depending on \u03c1 the loss of information incurred by this rather crude approach can be\nsmall already for bit depths between b = 3 and b = 5. That loss increases with \u03c1, with a fundamental\ngap compared to bin-based approaches and to the full precision case in the limit \u03c1 \u2192 1.\n\nLinear estimator. We \ufb01rst consider(cid:98)\u03c1lin = (cid:104)q, q(cid:48)(cid:105) /k. We note that(cid:98)\u03c1lin =(cid:98)\u03c1lin,b depends on b; b = \u221e\ncorresponds to the estimator(cid:98)\u03c1lin =(cid:98)\u03c1lin,\u221e in \u00a72 denoted by the same symbol. A crucial difference\napproaches whose bias needs to be analyzed carefully. The exact bias of(cid:98)\u03c1lin in dependence of \u03c1 and\n(tu\u22121, tu)(cid:1),\n\nbetween the code-based and the bin-based approaches discussed above is that the latter have vanishing\nasymptotic squared bias of the order O(k\u22122) for any b [27, 28]. This is not the case for code-based\nb can be evaluated exactly numerically. Numerical evaluations of bias and variance of estimators\ndiscussed in the present section only rely on the computation of coef\ufb01cients \u03b8\u03b1,\u03b2 de\ufb01ned by\n\nK(cid:88)\nwe have E\u03c1[(cid:98)\u03c1lin] = \u03b81,1, Var\u03c1((cid:98)\u03c1lin) = (\u03b82,2 \u2212 \u03b82\nprovide a bound on the bias of(cid:98)\u03c1lin which quanti\ufb01es explicitly the rate of decay in dependence b.\n\n(11)\nwhere \u03b1, \u03b2 are non-negative integers and (Z, Z(cid:48)) are bivariate normal (1) with r = \u03c1. Speci\ufb01cally,\n1,1)/k. In addition to exact numerical evaluation, we\n\n(cid:0)Z \u2208 \u03c3(ts\u22121, ts), Z\n\n\u03b8\u03b1,\u03b2 := E\u03c1[Q(Z)\u03b1Q(Z\n\n(cid:88)\n\n\u03c3,\u03c3(cid:48)\u2208{\u22121,1}\n\n)\u03b2\u00b5\u03b1\n\ns \u00b5\u03b2\n\nu P\u03c1\n\n(cid:48) \u2208 \u03c3\n\n(cid:48)\n\n(cid:48)\n\n\u03c3\u03b1(\u03c3\n\n(cid:48)\n\n)\u03b2] =\n\ns,u=1\n\nTheorem 1. We have Bias2\n\nb , where Db = 33/22\u03c0\n\n12 2\u22122b \u2248 2.72 \u00b7 2\u22122b.\n\n\u03c1((cid:98)\u03c1lin) \u2264 4\u03c12D2\n\nAs shown in Figure 2 (L), the bound on the squared bias in Theorem 1 constitutes a reasonable\nproxy of the exact squared bias. The rate of decay is O(2\u22124b). Moreover, it can be veri\ufb01ed\nnumerically that the variance in the full precision case upper bounds the variance for \ufb01nite b, i.e.,\n\nVar\u03c1((cid:98)\u03c1lin,b) \u2264 Var\u03c1((cid:98)\u03c1lin,\u221e), \u03c1 \u2208 [0, 1). Combining bias and variance, we may conclude that\ndepending on k, the MSE of(cid:98)\u03c1lin based on coarsely quantized data does not tend to be far from what\n(i) Suppose k = 100 and b = 3. With full precision, we have MSE\u03c1((cid:98)\u03c1lin,\u221e) = (1+\u03c12)/k \u2208 [.01, .02].\nFrom Figure 2 (M) and the observation that Var\u03c1((cid:98)\u03c1lin,3) \u2264 Var\u03c1((cid:98)\u03c1lin,\u221e), we \ufb01nd that the MSE can\n\nis achieved with full precision data. The following two examples illustrate this point.\n\ngo up by at most 7.2 \u00b7 10\u22123, i.e., it can at most double relative to the full precision case.\n\n(ii) Suppose k = 1000 and b = 4. With the same reasoning as in (i), the MSE under quantization can\nincrease at most by a factor of 1.45 as compared to full precision data.\n\nFigure 3 shows that these numbers still tend to be conservative. In general, the difference of the\nMSEs for b = \u221e on the one hand and b \u2208 {3, 4, 5} on the other hand gets more pronounced for large\n\nvalues of the similarity \u03c1 and large values of k. This is attributed to the (squared) bias of(cid:98)\u03c1lin. In\n\nparticular, it does not pay off to choose k signi\ufb01cantly larger than the order of the squared bias.\n\n5\n\n00.20.40.60.81-11-10-9-8-7-6-5-4-3-2-1log10(squared Bias)b = 2b = 3b = 4b = 5b = 600.20.40.60.810.811.21.41.61.82variance\fFigure 3: MSEs of(cid:98)\u03c1lin for various k and b \u2208 {3, 4, 5} (dotted). The solid (red) lines indicate the\ncorresponding MSEs for(cid:98)\u03c1lin in the full-precision case (b = \u221e).\nform (cid:98)\u03c1norm = (cid:104)z, z(cid:48)(cid:105) /((cid:107)z(cid:107)2 (cid:107)z(cid:48)(cid:107)2) can yield substantial bene\ufb01ts. Interestingly, it turns out that\nthe counterpart(cid:98)\u03c1norm = (cid:104)q, q(cid:48)(cid:105) /((cid:107)q(cid:107)2 (cid:107)q(cid:48)(cid:107)2) for quantized data is even more valuable as it helps\nreducing the bias of(cid:98)\u03c1lin = (cid:104)q, q(cid:48)(cid:105) /k. This effect can be seen easily in the limit \u03c1 \u2192 1 in which case\nBias\u03c1((cid:98)\u03c1norm) \u2192 0 by construction. In general, bias and variance can be evaluated as follows.\n\nNormalized estimator. In the full precision case we have seen that simple normalization of the\n\nProposition 1. In terms of the coef\ufb01cients \u03b8\u03b1,\u03b2 de\ufb01ned in (11), as k \u2192 \u221e, we have\n\n| Bias\u03c1[(cid:98)\u03c1norm]| =(cid:12)(cid:12) \u03b81,1\n(cid:16) \u03b82,2\nVar((cid:98)\u03c1norm) = 1\n\n\u03b82,0\n\n\u03b82\n2,0\n\nk\n\n\u2212 \u03c1(cid:12)(cid:12) + O(k\u22121)\n\n\u2212 2\u03b81,1\u03b83,1\n\n\u03b83\n2,0\n\n+\n\n(cid:17)\n\n\u03b82\n1,1(\u03b84,0+\u03b82,2)\n\n2\u03b84\n\n2,0\n\n+ O(k\u22122).\n\nFigure 4 (L,M) graphs the above two expressions. In particular, the plots highlight the reduction\n\nin bias compared to (cid:98)\u03c1lin and the fact that the variance is decreasing in \u03c1 as for b = \u221e. While\n\nProposition 1 is asymptotic, we verify a tight agreement in simulations for reasonably small k\n(cf. supplement).\n\nFigure 4: (L): Asymptotic Bias2\n\n\u03c1((cid:98)\u03c1lin). (M): Var\u03c1((cid:98)\u03c1norm) (asymptotic, to be\ndivided by k). (R): MSEs of(cid:98)\u03c1lin,4 vs. the MSEs of(cid:98)\u03c1coll,2 using twice the number of RPs (comparison\n\n\u03c1((cid:98)\u03c1norm) relative to Bias2\n\nat the bit level). The stars indicate the values of \u03c1 at which the MSEs of the two estimators are equal.\n\n3.3 Coding-based estimation vs. Collision-based estimation\n\napproaches (for \ufb01xed k) intersect are indicated by stars. As k decreases from 104 to 102, these values\n\nBoth schemes are comparable in terms of simplicity, but at the level of statistical performance none\nof the two dominates the other. The collision-based approach behaves favorably in a high similarity\n\nregime as shows a comparison of MSE\u03c1((cid:98)\u03c1col) (b = 2) and MSE\u03c1((cid:98)\u03c1norm) (b = 4) at the bit level\n(Figure 4 (R)): since(cid:98)\u03c1col uses only two bits for each of the k RPs, while(cid:98)\u03c1norm uses twice as many\nbits, we have doubled the number of RPs for(cid:98)\u03c1col. The values of \u03c1 for which the curves of the two\nincrease from about \u03c1 = 0.55 to \u03c1 = 0.95. In conclusion,(cid:98)\u03c1col is preferable in applications in which\nFigure 1 (L) shows that as b is raised,(cid:98)\u03c1col requires \u03c1 to be increasingly closer to one to achieve lower\n\nhigh similarities prevail, e.g., in duplicate detection. On the other hand, for generic high-dimensional\ndata, one would rather not expect \u03c1 to take high values given that two points drawn uniformly at\nrandom from the sphere are close to orthogonal with high probability.\n\nMSE. By contrast, increasing b for the coding-based schemes yields improvements essentially for the\n\n6\n\n00.20.40.60.81-4-3.5-3-2.5-2-1.5-1log10(MSE)k = 20k = 50k = 100k = 200k = 500k = 1000k = 2000k = 5000k = 10000b = 300.20.40.60.81-4-3.5-3-2.5-2-1.5-1log10(MSE)k = 20k = 50k = 100k = 200k = 500k = 1000k = 2000k = 5000k = 10000b = 400.20.40.60.81-4-3.5-3-2.5-2-1.5-1log10(MSE)k = 20k = 50k = 100k = 200k = 500k = 1000k = 2000k = 5000k = 10000b = 500.20.40.60.81-11-10-9-8-7-6-5-4-3-2-1log10(squared Bias)b = 2b = 3b = 4b = 5b = 600.20.40.60.8100.20.40.60.81varianceb = 3b = 2b = 00.20.40.60.81-5-4.5-4-3.5-3-2.5-2-1.5-1log10(MSE)k = 20k = 50k = 100k = 200k = 500k = 1000k = 2000k = 5000k = 10000\fwhole range of \u03c1. An interesting phenomenon occurs in the limit \u03c1 \u2192 1. It turns out that the rate of\n\ndecay of Var\u03c1((cid:98)\u03c1norm) is considerably slower than the rate of decay of Var\u03c1((cid:98)\u03c1col).\n\nTheorem 2. For any \ufb01nite b, we have\n\nVar\u03c1((cid:98)\u03c1norm) = \u0398((1 \u2212 \u03c1)1/2),\n\nVar\u03c1((cid:98)\u03c1col) = \u0398((1 \u2212 \u03c1)3/2) as \u03c1 \u2192 1.\n\nThe rate \u0398((1 \u2212 \u03c1)3/2) is the same as the MLE [28] which is slower than the rate \u0398((1 \u2212 \u03c1)2) in\nthe full precision case (cf. \u00a72). We conjecture that the rate \u0398((1 \u2212 \u03c1)1/2) is intrinsic to code-based\nestimation as this rate is also obtained when computing the full precision MLE (3) with quantized\ndata (i.e., z, z(cid:48) gets replaced by q, q(cid:48)).\n\n3.4 Quantization of norms\nLet us recall that according to our basic setup in \u00a71, we have assumed so far that it is possible to\n2, i \u2208 [n], of the original data prior to projection and quantization, and\ncompute the norms \u03bbi = (cid:107)xi(cid:107)2\nstore them in full precision to approximately recover inner products and squared distances via\nwhere(cid:98)\u03c1ij is an estimate of the cosine similarity of xi and xj. Depending on the setting, it may be\ntightly bounded in terms of the MSE for estimating cosine similarities and max1\u2264i\u2264n |(cid:98)\u03bbi\u2212 \u03bbi|, where\n{(cid:98)\u03bbi}n\n\ni=1 as well. It turns out that the MSE for estimating distances can be\n\ni=1 denote the quantized versions of {\u03bbi}n\n\n(cid:104)xi, xj(cid:105) \u2248 \u03bbi\u03bbj(cid:98)\u03c1ij,\n\ni=1; the precise bound is stated in the supplement.\n\nrequired to quantize the {\u03bbi}n\n\n(cid:107)xi \u2212 xj(cid:107)2\n\n2 \u2248 \u03bb2\n\ni + \u03bb2\n\nj \u2212 2\u03bbi\u03bbj(cid:98)\u03c1ij,\n\n4 Empirical results: linear classi\ufb01cation using quantized RPs\n\nOne traditional application of RPs is dimension reduction in linear regression or classi\ufb01cation with\nhigh-dimensional predictors [14, 30]. The results of \u00a73.2 suggest that as long as the number of RPs\nk are no more than a few thousand, subsequent scalar quantization to four bits is not expected to\nhave much of a negative effect relative to using full precision data. In this section, we verify this\nhypothesis for four high-dimensional data sets from the UCI repository: arcene (d = 104), Dexter\n(d = 2 \u00b7 104), farm (d = 5.5 \u00b7 104) and PEMS (d = 1.4 \u00b7 105).\nSetup. All data points are scaled to unit Euclidean norm before dimension reduction and scalar\nquantization based on the Lloyd-Max quantizer (10). The number of RPs k is varied according to\n{26, 27, . . . , 212}. For each of these values for k, we consider 20 independent realizations of the\nrandom projection matrix A. Given projected and quantized data {q1, . . . , qn}, we estimate the\n\nunderlying cosine similarities \u03c1ij as (cid:98)\u03c1ij = (cid:98)\u03c1(qi, qj), i, j \u2208 [n], where (cid:98)\u03c1(qi, qj) is a placeholder\nfor either the collision-based estimator(cid:98)\u03c1coll based on b = 2 bits or the normalized estimator(cid:98)\u03c1norm\nreference. The {(cid:98)\u03c1ij}1\u2264i,j\u2264n are then used as a kernel matrix fed into LIBSVM [9] to train a binary\nfor b \u2208 {1, 2, 4,\u221e} using data {qi(l) , qj(l)}k\nclassi\ufb01er. Prediction on test sets is performed accordingly. LIBSVM is run with 30 different values of\nits tuning parameter C ranging from 10\u22123 to 104.\nResults. A subset of the results is depicted in Figure 5 which is composed of three columns (one for\neach type of plot) and four rows (one for each data set). All results are averages over 20 independent\nsets of random projections. The plots in the left column show the minimum test errors over all 30\nchoices of the tuning parameter C under consideration in dependency of the number of RPs k. The\nplots in the middle column show the test errors in dependency of C for a selected value of k (the full\nset of plots can be found in the supplement). The plots in the right column provide a comparison of\n\nthe minimum (w.r.t. C) test errors of(cid:98)\u03c1coll,2 and(cid:98)\u03c1norm,4 at the bit level, i.e., with k doubled for(cid:98)\u03c1coll,2.\n(cid:98)\u03c1coll,2 and(cid:98)\u03c1norm,4, the latter consistently achieves better performance.\n\nIn all plots, classi\ufb01cation performance improves as b increases. What is more notable though is that\nthe gap between b = 4 and b = \u221e is indeed minor as anticipated. Regarding the performance of\n\nl=1; one-bit quantization (b = 1) is here included as a\n\n5 Conclusion\n\nIn this paper, we have presented theoretical and empirical evidence that it is possible to achieve\nadditional data compression in the use of random projections by means of coarse scalar quantization.\n\n7\n\n\fFigure 5: Results of the classi\ufb01cation experiments. Each row corresponds to one data set. (L):\nAccuracy on the test set (optimized over C) in dependence of the number of RPs k (log2 scale). (M):\nAccuracy on the test set for a selected value of k in dependence of log10(C). (R): Comparison of the\n\ntest accuracies when using the estimators(cid:98)\u03c1norm,4 respectively(cid:98)\u03c1coll,2 with twice the number of RPs.\n\nThe loss of information incurred at this step tends to be mild even with the naive approach in which\nquantized data are treated in the same way as their full precision counterparts. An exception only\narises for cosine similarities close to 1 (Theorem 2). We have also shown that the simple form of\n\nnormalization employed in the construction of the estimator(cid:98)\u03c1norm can be extremely bene\ufb01cial, even\n\nmore so for coarsely quantized data because of a crucial bias reduction.\nRegarding future work, it is worthwhile to consider the extension to the case in which the random\nprojections are not Gaussian but arise from one of the various structured Johnson-Lindenstrauss\ntransforms, e.g., those in [2, 3, 23]. A second direction of interest is to analyze the optimal trade-off\nbetween the number of RPs k and the bit depth b in dependence of the similarity \u03c1; in the present\nwork, the choice of b has been driven with the goal of roughly matching the full precision case.\n\n8\n\n6789101112log2(k)0.630.660.690.720.750.780.81accuracy on test setPEMS-4-2024log10(C)0.20.30.40.50.60.7accuracy on test setPEMS, k = 6467891011log2(k)0.40.450.50.550.60.650.70.750.8accuracy on test setPEMS6789101112log2(k)0.60.650.70.750.80.850.9accuracy on test setDexter-4-2024log10(C)0.550.60.650.70.750.8accuracy on test setDexter, k = 51267891011log2(k)0.60.650.70.750.80.850.9accuracy on test setDexter6789101112log2(k)0.70.750.80.850.9accuracy on test setfarm-4-2024log10(C)0.50.550.60.650.70.75accuracy on test setfarm, k = 6467891011log2(k)0.70.750.80.850.9accuracy on test setfarm6789101112log2(k)0.650.70.750.80.85accuracy on test setarcene-4-2024log10(C)0.550.60.650.70.750.80.85accuracy on test setarcene, k = 51267891011log2(k)0.60.650.70.750.80.85accuracy on test setarcene\fAcknowledgments\n\nThe work was partially supported by NSF-Bigdata-1419210, NSF-III-1360971. Ping Li also thanks\nMichael Mitzenmacher for helpful discussions.\n\nReferences\n[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal\n\nof Computer and System Sciences, 66:671\u2013687, 2003.\n\n[2] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.\n\nIn Proceedings of the Symposium on Theory of Computing (STOC), pages 557\u2013563, 2006.\n\n[3] N. Ailon and E. Liberty. Almost optimal unrestricted fast Johnson\u2013Lindenstrauss transform. ACM\n\nTransactions on Algorithms, 9:21, 2013.\n\n[4] T. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 2003.\n\n[5] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and\n\ntext data. In Conference on Knowledge Discovery and Data Mining (KDD), pages 245\u2013250, 2001.\n\n[6] P. Boufounos and R. Baraniuk. 1-bit compressive sensing. In Information Science and Systems, 2008.\n\n[7] C. Boutsidis, A. Zouzias, and P. Drineas. Random Projections for k-means Clustering. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 298\u2013306. 2010.\n\n[8] E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies? IEEE Transactions on Information Theory, 52:5406\u20135425, 2006.\n\n[9] C-C. Chang and C-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent\n\nSystems and Technology, 2:27:1\u201327:27, 2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[10] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium\n\non Theory of Computing (STOC), pages 380\u2013388, 2002.\n\n[11] S. Dasgupta. Learning mixtures of Gaussians. In Symposium on Foundations of Computer Science (FOCS),\n\npages 634\u2013644, 1999.\n\n[12] S. Dasgupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and\n\nAlgorithms, 22:60\u201365, 2003.\n\n[13] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable\n\nDistributions. In Symposium on Computational Geometry (SCG), pages 253\u2013262, 2004.\n\n[14] D. Fradkin and D. Madigan. Experiments with random projections for machine learning. In Conference on\n\nKnowledge Discovery and Data Mining (KDD), pages 517\u2013522, 2003.\n\n[15] A. Gersho and R. Gray. Vector Quantization and Signal Compression. Springer, 1991.\n\n[16] M. Goemans and D. Williamson. Improved Approximation Algorithms for Maximum Cut and Satis\ufb01ability\n\nProblems Using Semide\ufb01nite Programming. Journal of the ACM, 42:1115\u20131145, 1995.\n\n[17] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\n\nIn Proceedings of the Symposium on Theory of Computing (STOC), pages 604\u2013613, 1998.\n\n[18] J. Matousek. On variants of the Johnson-Lindenstrauss lemma. Random Structures and Algorithms,\n\n33:142\u2013156, 2008.\n\n[19] L. Jacques. A Quantized Johnson-Lindenstrauss Lemma: The Finding of Buffon\u2019s needle. IEEE Transac-\n\ntions on Information Theory, 61:5012\u20135027, 2015.\n\n[20] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary\n\nMathematics, pages 189\u2013206, 1984.\n\n[21] K. Kenthapadi, A. Korolova, I. Mironov, and N. Mishra. Privacy via the Johnson-Lindenstrauss Transform.\n\nJournal of Privacy and Con\ufb01dentiality, 5, 2013.\n\n[22] J. Kieffer. Uniqueness of locally optimal quantizer for log-concave density and convex error weighting\n\nfunction. IEEE Transactions on Information Theory, 29:42\u201347, 1983.\n\n9\n\n\f[23] F. Krahmer and R. Ward. New and improved Johnson-Lindenstrauss embeddings via the Restricted\n\nIsometry Property. SIAM Journal on Mathematical Analysis, 43:1269\u20131281, 2011.\n\n[24] J. Laska and R. Baraniuk. Regime change: Bit-depth versus measurement-rate in compressive sensing.\n\nIEEE Transactions on Signal Processing, 60:3496\u20133505, 2012.\n\n[25] M. Li, S. Rane, and P. Boufounos. Quantized embeddings of scale-invariant image features for mobile\naugmented reality. In International Workshop on Multimedia Signal Processing (MMSP), pages 1\u20136, 2012.\n\n[26] P. Li, T. Hastie, and K. Church. Improving Random Projections Using Marginal Information. In Annual\n\nConference on Learning Theory (COLT), pages 635\u2013649, 2006.\n\n[27] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for Random Projections. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), pages 676\u2013678, 2014.\n\n[28] P. Li, M. Mitzenmacher, and M. Slawski. Quantized Random Projections and Non-Linear Estimation of\nCosine Similarity. In Advances in Neural Information Processing Systems (NIPS), pages 2756\u20132764. 2016.\n\n[29] M. Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine\n\nLearning, 3:123\u2013224, 2011.\n\n[30] O. Maillard and R. Munos. Compressed least-squares regression. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 1213\u20131221. 2009.\n\n[31] S. Rane and P. Boufounos. Privacy-preserving nearest neighbor methods: Comparing signals wihtout\n\nrevealing them. IEEE Signal Processing Magazine, 30:18\u201328, 2013.\n\n[32] S. Rane, P. Boufounos, and A. Vetro. Quantized embeddings: An ef\ufb01cient and universal nearest neighbor\nmethod for cloud-based image retrieval. In SPIE Optical Engineering and Applications, pages 885609\u2013\n885609. International Society for Optics and Photonics, 2013.\n\n[33] S. Vempala. The Random Projection Method. American Mathematical Society, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2384, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}, {"given_name": "Martin", "family_name": "Slawski", "institution": null}]}