{"title": "A Greedy Approach for Budgeted Maximum Inner Product Search", "book": "Advances in Neural Information Processing Systems", "page_first": 5453, "page_last": 5462, "abstract": "Maximum Inner Product Search (MIPS) is an important task in many machine learning applications such as the prediction phase of low-rank matrix factorization models and deep learning models. Recently, there has been substantial research on how to perform MIPS in sub-linear time, but most of the existing work does not have the flexibility to control the trade-off between search efficiency and search quality. In this paper, we study the important problem of MIPS with a computational budget. By carefully studying the problem structure of MIPS, we develop a novel Greedy-MIPS algorithm, which can handle budgeted MIPS by design. While simple and intuitive, Greedy-MIPS yields surprisingly superior performance compared to state-of-the-art approaches. As a specific example, on a candidate set containing half a million vectors of dimension 200, Greedy-MIPS runs 200x faster than the naive approach while yielding search results with the top-5 precision greater than 75%.", "full_text": "A Greedy Approach for\n\nBudgeted Maximum Inner Product Search\n\nHsiang-Fu Yu\u21e4\nAmazon Inc.\n\nrofuyu@cs.utexas.edu\n\nCho-Jui Hsieh\n\nUniversity of California, Davis\n\nchohsieh@ucdavis.edu\n\nQi Lei\n\nThe University of Texas at Austin\n\nleiqi@ices.utexas.edu\n\nInderjit S. Dhillon\n\nThe University of Texas at Austin\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nMaximum Inner Product Search (MIPS) is an important task in many machine\nlearning applications such as the prediction phase of low-rank matrix factorization\nmodels and deep learning models. Recently, there has been substantial research\non how to perform MIPS in sub-linear time, but most of the existing work does\nnot have the \ufb02exibility to control the trade-off between search ef\ufb01ciency and\nsearch quality. In this paper, we study the important problem of MIPS with a\ncomputational budget. By carefully studying the problem structure of MIPS, we\ndevelop a novel Greedy-MIPS algorithm, which can handle budgeted MIPS by\ndesign. While simple and intuitive, Greedy-MIPS yields surprisingly superior\nperformance compared to state-of-the-art approaches. As a speci\ufb01c example, on a\ncandidate set containing half a million vectors of dimension 200, Greedy-MIPS\nruns 200x faster than the naive approach while yielding search results with the\ntop-5 precision greater than 75%.\n\nIntroduction\n\n1\nIn this paper, we study the computational issue in the prediction phase for many embedding based\nmodels such as matrix factorization and deep learning models in recommender systems, which can be\nmathematically formulated as a Maximum Inner Product Search (MIPS) problem. Speci\ufb01cally, given\n\na large collection of n candidate vectors: H =hj 2 Rk : 1, . . . , n and a query vector w 2 Rk,\n\nMIPS aims to identify a subset of candidates that have top largest inner product values with w. We\nalso denote H = [h1, . . . , hj, . . . , hn]> as the candidate matrix. A naive linear search procedure\nto solve MIPS for a given query w requires O(nk) operations to compute n inner products and\nO(n log n) operations to obtain the sorted ordering of the n candidates.\nRecently, MIPS has drawn a lot of attention in the machine learning community due to its wide\napplicability, such as the prediction phase of embedding based recommender systems [6, 7, 10].\nIn such an embedding based recommender system, each user i is associated with a vector wi of\ndimension k, while each item j is associated with a vector hj of dimension k. The interaction (such\nas preference) between a user and an item is modeled by wT\ni hj. It is clear that identifying top-ranked\nitems in such a system for a user is exactly a MIPS problem. Because both the number of users\n(the number of queries) and the number of items (size of vector pool in MIPS) can easily grow to\nmillions, a naive linear search is extremely expensive; for example, to compute the preference for all\nm users over n items with latent embeddings of dimension k in a recommender system requires at\nleast O(mnk) operations. When both m and n are large, the prediction procedure is extremely time\nconsuming; it is even slower than the training procedure used to obtain the m + n embeddings, which\n\n\u21e4Work done while at the University of Texas at Austin.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\freduction that is essential in many state-of-the-art approaches [2, 12, 14].\n\nbetween computation ef\ufb01ciency and search quality in the prediction phase.\n\ncosts only O(|\u2326|k) operations per iteration, where |\u2326| is number of observations and is much smaller\nthan mn. Taking the yahoo-music dataset as an example, m = 1M, n = 0.6M, |\u2326| = 250M,\nand mn = 600B  250M = |\u2326|. As a result, the development of ef\ufb01cient algorithms for MIPS is\nneeded in large-scale recommender systems. In addition, MIPS can be found in many other machine\nlearning applications, such as the prediction for a multi-class or multi-label classi\ufb01er [16, 17], an\nobject detector, a structure SVM predicator, or as a black-box routine to improve the ef\ufb01ciency of\nlearning and inference algorithm [11]. Also, the prediction phase of neural network could also bene\ufb01t\nfrom a faster MIPS algorithm: the last layer of NN is often a dense fully-connected layer, so \ufb01nding\nthe label with maximum score becomes a MIPS problem with dense vectors [6].\nThere is a recent line of research on accelerating MIPS for large n, such as [2, 3, 9, 12\u201314]. However,\nmost of them do not have the \ufb02exibility to control the trade-off between search ef\ufb01ciency and\nsearch quality in the prediction phase. In this paper, we consider the budgeted MIPS problem,\nwhich is a generalized version of the standard MIPS with a computation budget: how to generate\na set of top-ranked candidates under a given budget on the number of inner products one can\nperform. By carefully studying the problem structure of MIPS, we develop a novel Greedy-MIPS\nalgorithm, which handles budgeted MIPS by design. While simple and intuitive, Greedy-MIPS yields\nsurprisingly superior performance compared to existing approaches.\nOur Contributions:\n\u2022 We develop Greedy-MIPS, which is a novel algorithm without any nearest neighbor search\n\u2022 We establish a sublinear time theoretical guarantee for Greedy-MIPS under certain assumptions.\n\u2022 Greedy-MIPS is orders of magnitudes faster than many state-of-the-art MIPS approaches to\nobtain a desired search performance. As a speci\ufb01c example, on the yahoo-music data sets with\nn = 624, 961 and k = 200, Greedy-MIPS runs 200x faster than the naive approach and yields\nsearch results with the top-5 precision more than 75%, while the search performance of other\nstate-of-the-art approaches under the similar speedup drops to less than 3% precision.\n\u2022 Greedy-MIPS supports MIPS with a budget, which brings the ability to control of the trade-off\n2 Existing Approaches for Fast MIPS\nBecause of its wide applicability, several algorithms have been proposed for ef\ufb01cient MIPS. Most of\nexisting approaches consider to reduce the MIPS problem to the nearest neighbor search problem\n(NNS), where the goal is to identify the nearest candidates of the given query, and apply an existing\nef\ufb01cient NNS algorithm to solve the reduced problem. [2] is the \ufb01rst MIPS work which adopts\nsuch a MIPS-to-NNS reduction. Variants MIPS-to-NNS reduction are also proposed in [14, 15].\nExperimental results in [2] show the superiority of the NNS reduction over the traditional branch-\nand-bound search approaches for MIPS [9, 13]. After the reduction, there are many choices to solve\nthe transformed NNS problem, such as locality sensitive hashing scheme (LSH-MIPS) considered in\n[12, 14, 15], PCA-tree based approaches (PCA-MIPS) in [2], or K-Means approaches in [1].\nFast MIPS approaches with sampling schemes have become popular recently. Various sampling\nschemes have been proposed to handle MIPS problem with different constraints. The idea of the\nsampling-based MIPS approach is \ufb01rst proposed in [5] as an approach to perform approximate\nmatrix-matrix multiplications.\nIts applicability on MIPS problems is studied very recently [3].\nThe idea behind a sampling-based approach called Sample-MIPS, is about to design an ef\ufb01cient\nsampling procedure such that the j-th candidate is selected with probability p(j): p(j) \u21e0 h>j w. In\nparticular, Sample-MIPS is an ef\ufb01cient scheme to sample (j, t) 2 [n] \u21e5 [k] with the probability\np(j, t): p(j, t) \u21e0 hjtwt. Each time a pair (j, t) is sampled, we increase the count for the j-th item by\none. By the end of the sampling process, the spectrum of the counts forms an estimation of n inner\nproduct values. Due to the nature of the sampling approach, it can only handle the situation where all\nthe candidate vectors and query vectors are nonnegative.\nDiamond-MSIPS, a diamond sampling scheme proposed in [3], is an extension of Sample-MIPS\nto handle the maximum squared inner product search problem (MSIPS) where the goal is to identify\ncandidate vectors with largest values of (h>j w)2. However, the solutions to MSIPS can be very\ndifferent from the solutions to MIPS in general. For example, if all the inner product values are\nnegative, the ordering for MSIPS is the exactly reverse ordering induced by MIPS. Here we can see\nthat the applicability of both Sample-MIPS and Diamond-MSIPS to MIPS is very limited.\n\n2\n\n\f3 Budgeted MIPS\nThe core idea behind the fast approximate MIPS approaches is to trade the search quality for the\nshorter query latency: the shorter the search latency, the lower the search quality. In most existing fast\nMIPS approaches, the trade-off depends on the approach-speci\ufb01c parameters such as the depth of the\nPCA tree in PCA-MIPS or the number of hash functions in LSH-MIPS. Such speci\ufb01c parameters\nare usually required to construct approach-speci\ufb01c data structures before any query is given, which\nmeans that the trade-off is somewhat \ufb01xed for all the queries. Thus, the computation cost for a\ngiven query is \ufb01xed. However, in many real-world scenarios, each query might have a different\ncomputational budget, which raises the question: Can we design a MIPS approach supporting the\ndynamic adjustment of the trade-off in the query phase?\n3.1 Essential Components for Fast MIPS\nBefore any query request:\n\u2022 Query-Independent Data Structure Construction: A pre-processing procedure is performed on the\nentire candidate sets to construct an approach-speci\ufb01c data structure D to store information about\nH: the LSH hash tables, space partition trees (e.g., KD-tree or PCA-tree), or cluster centroids.\n\nFor each query request:\n\u2022 Query-dependent Pre-processing: In some approaches, a query dependent pre-processing is needed.\nFor example, a vector augmentation is required in all MIPS-to-NNS approaches. In addition, [2]\nalso requires another normalization. TP is used to denote the time complexity of this stage.\n\u2022 Candidate Screening: In this stage, based on the pre-constructed data structure D, an ef\ufb01cient\nprocedure is performed to \ufb01lter candidates such that only a subset of candidates C(w) \u21e2H is\nselected. In a naive linear approach, no screening procedure is performed, so C(w) simply contains\nall the n candidates. For a tree-based structure, C(w) contains all the candidates stored in the leaf\nnode of the query vector. In a sampling-based MIPS approach, an ef\ufb01cient sampling scheme is\ndesigned to generate highly possible candidates to form C(w). TS denotes the computational cost\nof the screening stage.\n\u2022 Candidate Ranking: An exact ranking is performed on the selected candidates in C(w) ob-\ntained from the screening stage. This involves the computation of |C(w)| inner products\nand the sorting procedure among these |C(w)| values. The overall time complexity TR =\nO(|C(w)|k + |C(w)| log|C(w)|).\n\nThe per-query computational cost: TQ = TP + TS + TR.\n\n(1)\n\nIt is clear that the candidate screening stage is the key component for a fast MIPS approach. In\nterms of the search quality, the performance highly depends on whether the screening procedure can\nidentify highly possible candidates. Regarding the query latency, the ef\ufb01ciency highly depends on the\nsize of C(w) and how fast to generate C(w). The major difference among various MIPS approaches\nis the choice of the data structure D and the screening procedure.\n3.2 Budgeted MIPS: Problem De\ufb01nition\nBudgeted MIPS is an extension of the standard approximate MIPS problem with a computational\nbudget: how to generate top-ranked candidates under a given budget on the number of inner products\none can perform. Note that the cost for the candidate ranking (TR) is inevitable in the per-query\ncost (1). A viable approach for budgeted MIPS must include a screening procedure which satis\ufb01es\nthe following requirements:\n\u2022 the \ufb02exibility to control the size of C(w) in the candidate screening stage such that |C(w)|\uf8ff B,\nwhere B is a given budget, and\n\u2022 an ef\ufb01cient screening procedure to obtain C(w) in O(Bk) time such thatTQ = O(Bk + B log B).\nAs mentioned earlier, most recently proposed MIPS-to-NNS approaches algorithms apply various\nsearch space partition data structures or techniques (e.g., LSH, KD-tree, or PCA-tree) designed for\nNNS to index the candidates H in the query-independent pre-processing stage. As the construction\nof D is query independent, both the search performance and the computation cost are somewhat\n\ufb01xed when the construction is done. For example, the performance of a PCA-MIPS depends on\nthe depth of the PCA-tree. Given a query vector w, there is no control to the size of C(w) in the\ncandidate generating phase. LSH-based approaches also have the similar issue. There might be some\nad-hoc treatments to adjust C(w), it is not clear how to generalize PCA-MIPS and LSH-MIPS in a\nprincipled way to handle the situation with a computational budget: how to reduce the size of C(w)\nunder a limited budget and how to improve the performance when a larger budget is given.\n\n3\n\n\fUnlike other NNS-based algorithms, the design of Sample-MIPS naturally enables it to support\nbudgeted MIPS for a nonnegative candidate matrix H and a nonnegative query w. The more the\nnumber of samples, the lower the variance of the estimated frequency spectrum. Clearly, Sample-\nMIPS has the \ufb02exibility to control the size of C(w), and thus is a viable approach for the budgeted\nMIPS problem. However, Sample-MIPS works only on the situation with non-negative H and w.\nDiamond-MSIPS has the similar issue.\n4 Greedy-MIPS\nWe carefully study the structure of MIPS and develop a simple but novel algorithm called Greedy-\nMIPS, which handles budgeted MIPS by design. Unlike the recent MIPS-to-NNS approaches,\nGreedy-MIPS is an approach without any reduction to a NNS problem. Moreover, Greedy-MIPS is\na viable approach for the budgeted MIPS problem without the non-negativity limitation inherited in\nthe sampling approaches.\nThe key component for a fast MIPS approach is the algorithm used in the candidate screening phase.\nIn budgeted MIPS, for any given budget B and query w, an ideal procedure for the candidate\nscreening phase costs O(Bk) time to generate C(w) which contains the B items with the largest B\ninner product values over the n candidates in H. The requirement on the time complexity O(Bk)\nimplies that the procedure is independent from n = |H|, the number of candidates in H. One might\nwonder whether such an ideal procedure exists or not. In fact, designing such an ideal procedure with\nthe requirement to generate the largest B items in O(Bk) time is even more challenging than the\noriginal budgeted MIPS problem.\n\nrank(x | X ) :=X|X|\n\nDe\ufb01nition 1. The rank of an item x among a set of items X =x1, . . . , x|X| is de\ufb01ned as\n(2)\nwhere I[\u00b7] is the indicator function. A ranking induced by X is a function \u21e1(\u00b7) : X!{ 1, . . . ,|X|}\nsuch that \u21e1(xj) = rank(xj | X ) 8xj 2X .\nOne way to store a ranking \u21e1(\u00b7) induced by X is by a sorted index array s[r] of size |X| such that\nWe can see that s[r] stores the index to the item x with \u21e1(x) = r.\nTo design an ef\ufb01cient candidate screening procedure, we study the operations required for MIPS: In\nthe simple linear MIPS approach, nk multiplication operations are required to obtain n inner product\n\n\u21e1(xs[1]) \uf8ff \u21e1(xs[2]) \uf8ff\u00b7\u00b7\u00b7\uf8ff \u21e1(xs[|X|]).\n\nI[xj  x],\n\nj=1\n\nvaluesh>1 w, . . . , h>n w . We de\ufb01ne an implicit matrix Z 2 Rn\u21e5k as Z = H diag(w), where\ndiag(w) 2 Rk\u21e5k is a matrix with w as it diagonal. The (j, t) entry of Z denotes the multiplication\noperation zjt = hjtwt and zj = diag(w)hj denotes the j-th row of Z. In Figure 1, we use Z> to\ndemonstrate the implicit matrix. Note that Z is query dependant, i.e., the values of Z depend on the\nquery vector w, and n inner product values can be obtained by taking the column-wise summation of\nZ>. In particular, for each j we have h>j w =Pk\nt=1 zjt, j = 1, . . . , n. Thus, the ranking induced\nby the n inner product values can be characterized by the marginal ranking \u21e1(j|w) de\ufb01ned on the\nimplicit matrix Z as follows:\n\u21e1(j|w) := rank kXt=1\nkXt=1\n\nznt)! = rankh>j w |h>1 w, . . . , h>n w .\n\nAs mentioned earlier, it is hard to design an ideal candidate screening procedure generating C(w)\nbased on the marginal ranking. Because the main goal for the candidate screening phase is to\nquickly identify candidates which are highly possible to be top-ranked items, it suf\ufb01ces to have\nan ef\ufb01cient procedure generating C(w) by an approximation ranking. Here we propose a greedy\nheuristic ranking:\n(4)\nwhich is obtained by replacing the summation terms in (3) by max operators. The intuition behind\nthis heuristic is that the largest element of zj multiplied by k is an upper bound of h>j w:\n\nzjt\n\u00af\u21e1(j|w) := rank maxk\nkXt=1\nThus, \u00af\u21e1(j|w), which is induced by such an upper bound of h>j w, could be a reasonable approximation\nranking for the marginal ranking \u21e1(j|w).\n\nt=1 zjtmaxk\n\nzjt \uf8ff k max{zjt : t = 1, . . . , k}.\n\nt=1 znt ,\n\nt=1 z1t,\u00b7\u00b7\u00b7 , maxk\n\n( kXt=1\n\nz1t,\u00b7\u00b7\u00b7 ,\n\n(3)\n\n(5)\n\nh>j w =\n\n4\n\n\fNext we design an ef\ufb01cient procedure which\ngenerates C(w) according to the ranking \u00af\u21e1(j|w)\nde\ufb01ned in (4). First, based on the relative order-\nings of {zjt}, we consider the joint ranking and\nthe conditional ranking de\ufb01ned as follows:\n\u2022 Joint ranking: \u21e1(j, t|w) is the exact ranking\n\nover the nk entries of Z.\n\u21e1(j, t|w) := rank(zjt | {z11, . . . , znk}).\n\u2022 Conditional ranking: \u21e1t(j|w) is the exact\nranking over the n entires of the t-th row of\nZ>.\n\nZ> = diag(w)H> : zjt = hjtwt,8j, t\n\n\u21e1(j, t|w)\n\nz11\nz11\n\nz21\nz21\n\nz31\nz31\n\nz41\nz41\n\nz51\nz51\n\nz61\nz61\n\nz71\nz71\n\nz12\nz12\n\nz22\nz22\n\nz32\nz32\n\nz42\nz42\n\nz52\nz52\n\nz62\nz62\n\nz72\nz72\n\n+\n\nz13\nz13\n\nz23\nz23\n\nz33\nz33\n\nz43\nz43\n\nz53\nz53\n\nz63\nz63\n\nz73\nz73\n\nh1>w h2>w h3>w h4>w h5>w h6>w h7>w\n\n\u21e1t(j|w)\n\n\u21e1(j|w)\n\nFigure 1: nk multiplications in a naive linear MIPS\napproach. \u21e1(j, t|w): joint ranking. \u21e1t(j|w): con-\nditional ranking. \u21e1(j|w): marginal ranking.\n\n\u21e1t(j|w) := rank(zjt | {z1t, . . . , znt}).\n\nSee Figure 1 for an illustration for both rankings. Similar to the marginal ranking, both joint and\nconditional rankings are query dependent.\nObserve that, in (4), for each j, only a single maximum entry of Z, maxk\nt=1 zjt, is considered to\nobtain the ranking \u00af\u21e1(j|w). To generate C(w) based on \u00af\u21e1(j|w), we can iterate (j, t) entries of Z in\na greedy sequence such that (j1, t1) is visited before (j2, t2) if zj1t1 > zj2t2, which is exactly the\nsequence corresponding to the joint ranking \u21e1(j, t|w). Each time an entry (j, t) is visited, we can\ninclude the index j into C(w) if j /2C (w). In Theorem 1, we show that the sequence to include a\nnewly observed j into C(w) is exactly the sequence induced by the ranking \u00af\u21e1(j|w) de\ufb01ned in (4).\nTheorem 1. For all j1 and j2 such that \u00af\u21e1(j1|w) < \u00af\u21e1(j2|w), j1 will be included into C(w) before\nj2 if we iterate (j, t) pairs following the sequence induced by the joint ranking \u21e1(j, t|w). A proof\ncan be found in Section D.1.\nAt \ufb01rst glance, generating (j, t) in the sequence according to the joint ranking \u21e1(j, t|w) might require\nthe access to all the nk entries of Z and cost O(nk) time. In fact, based on Property 1 of conditional\nrankings, we can design an ef\ufb01cient variant of the k-way merge algorithm [8] to generate (j, t) pairs\nin the desired sequence iteratively.\nProperty 1. Given a \ufb01xed candidate matrix H, for any possible w with wt 6= 0, the conditional\nranking \u21e1t(j|w) is either \u21e1t+(j) or \u21e1t(j), where \u21e1t+(j) = rank(hjt | {h1t, . . . , hnt}), and\n\u21e1t(j) = rank(hjt | {h1t, . . . ,hnt}). In particular, \u21e1t(j|w) =\u21e2\u21e1t+(j)\nProperty 1 enables us to characterize a query dependent conditional ranking \u21e1t(j|w) by two query\nindependent rankings \u21e1t+(j) and \u21e1t(j). Thus, for each t, we can construct and store a sorted index\narray st[r], r = 1, . . . , n such that\n\nif wt > 0,\nif wt < 0.\n\n\u21e1t(j)\n\n(6)\n\n\u21e1t+(st[1]) \uf8ff \u21e1t+(st[2]) \uf8ff\u00b7\u00b7\u00b7\uf8ff \u21e1t+(st[n]),\n\u21e1t(st[1])  \u21e1t(st[2]) \u00b7\u00b7\u00b7 \u21e1t(st[n]).\n\n(7)\nThus, in the phase of query-independent data structure construction of Greedy-MIPS, we compute\nand store k query-independent rankings \u21e1t+(\u00b7) by k sorted index arrays of length n: st[r], r =\n1, . . . , n, t = 1, . . . , k. The entire construction costs O(kn log n) time and O(kn) space.\nNext we describe the details of the proposed Greedy-MIPS algorithm for a given query w and a\nbudget B. Greedy-MIPS utilizes the idea of the k-way merge algorithm to visit (j, t) entries of Z\naccording to the joint ranking \u21e1(j, t|w). Designed to merge k sorted sublists into a single sorted\nlist, the k-way merge algorithm uses 1) k pointers, one for each sorted sublist, and 2) a binary tree\nstructure (either a heap or a selection tree) containing the elements pointed by these k pointers to\nobtain the next element to be appended into the sorted list [8].\n4.1 Query-dependent Pre-processing\nWe divide nk entries of (j, t) into k groups. The t-th group contains n entries: {(j, t) : j = 1, . . . , n}.\nHere we need an iterator playing a similar role as the pointer which can iterate index j 2{ 1, . . . , n}\nin the sorted sequence induced by the conditional ranking \u21e1t(\u00b7|w). Utilizing Property 1, the t-th\npre-computed sorted arrays st[r], r = 1, . . . , n can be used to construct such an iterator, called\nCondIter, which supports current() to access the currently pointed index j and getNext() to\n\n5\n\n\fAlgorithm 1 CondIter:\nan iterator over j 2\n{1, . . . , n} based on the conditional ranking \u21e1t(j|w).\nThis code assumes that the k sorted index arrays\nst[r], r = 1, . . . , n, t = 1, . . . , k are available.\nclass CondIter:\n\ndef constructor(dim_idx, query_val):\nt, w, ptr dim_idx, query_val, 1\n\ndef current():\n\nreturn\u21e2st[ptr]\nptr ptr + 1 and return current()\n\ndef hasNext(): return (ptr < n)\ndef getNext():\n\nst[n  ptr + 1]\n\nif w > 0,\notherwise.\n\npre-\n\nQuery-dependent\n\nAlgorithm 2\nprocessing procedure in Greedy-MIPS.\n\u2022 Input: query w 2 Rk\n\u2022 For t = 1, . . . , k\n- iters[t] CondIter(t, wt)\n- z hjtwt,\n- Q.push((z, t))\n\nwhere j = iters[t].current()\n\n\u2022 Output:\n\n- iters[t], t \uf8ff k: iterators for \u21e1t(\u00b7|w).\n- Q: a max-heap of\n\u21e2(z, t) | z =\nn\n\nzjt, 8t \uf8ff k.\n\nmax\nj=1\n\nadvance the iterator. In Algorithm 1, we describe a pseudo code for CondIter, which utilizes the\nfacts (6) and (7) such that both the construction and the index access cost O(1) space and O(1) time.\nFor each t, we use iters[t] to denote the CondIter for the t-th conditional ranking \u21e1t(j|w).\nRegarding the binary tree structure used in Greedy-MIPS, we consider a max-heap Q of (z, t)\npairs. z 2 R is the compared key used to maintain the heap property of Q, and t 2{ 1, . . . , k}\nis an integer to denote the index to a entry group. Each (z, t) 2 Q denotes the (j, t) entry of Z\nwhere j = iters[t].current() and z = zjt = hjtwt. Note that there are most k elements in the\nmax-heap at any time. Thus, we can implement Q by a binary heap such that 1) Q.top() returns the\nmaximum pair (z, t) in O(1) time; 2) Q.pop() deletes the maximum pair of Q in O(log k) time; and\n3) Q.push((z, t)) inserts a new pair in O(log k) time. Note that the entire Greedy-MIPS can also be\nimplemented using a selection tree among the k entries pointed by the k iterators. See Section B in\nthe supplementary material for more details.\nIn the query-dependent pre-processing phase, we need to construct iters[t], t = 1, . . . , k,\none for each conditional ranking \u21e1t(j|w), and a max-heap Q which is initialized to contain\nj=1 zjt, t \uf8ff k . A detailed procedure is described in Algorithm 2 which costs\n(z, t) | z = maxn\nO(k log k) time and O(k) space.\n4.2 Candidate Screening\nThe core idea of Greedy-MIPS is to iteratively traverse (j, t) entries of Z in a greedy sequence and\ncollect newly observed indices j into C(w) until |C(w)| = B. In particular, if r = \u21e1(j, t|w), then\n(j, t) entry is visited at the r-th iterate. Similar to the k-way merge algorithm, we describe a detailed\nprocedure in Algorithm 3, which utilizes the CondIter in Algorithm 1 to perform the screening.\nRecall both requirements of a viable candidate screening procedure for budgeted MIPS: 1) the\n\ufb02exibility to control the size |C(w)|\uf8ff B; and 2) an ef\ufb01cient procedure runs in O(Bk). First, it is\nclear that Algorithm 3 has the \ufb02exibility to control the size of C(w) by the exiting condition of the\nouter while-loop. Next, to analyze the overall time complexity of Algorithm 3, we need to know the\nnumber of the zjt entries the algorithm iterates before C(w) = B. Theorem 2 gives an upper bound\non this number of iterations.\nTheorem 2. There are at least B distinct indices j in the \ufb01rst Bk entries (j, t) in terms of the joint\nranking \u21e1(j, t|w) for any w; that is,\n(8)\nA detailed proof can be found in Section D of the supplementary material. Note that there are\nsome O(log k) time operations within both the outer and inner while loops such as Q.push((z, t))\nand Q.pop()). As the goal of the screening procedure is to identify j indices only, we can skip the\n\n|{j |8 (j, t) such that \u21e1(j, t|w) \uf8ff Bk}|  B.\n\nQ.pushzjt, t for an entry (j, t) with the j having been included in C(w). As a results, we can\nguarantee that Q.pop() is executed at most B + k  1 times when |C(w)| = B. The extra k  1 times\noccurs in the situation that\n\nat the beginning of the entire screening procedure.\n\niters[1].current() = \u00b7\u00b7\u00b7 = iters[k].current()\n\n6\n\n\f\u2014 z hjtwt and Q.push((z, t)) \u00b7\u00b7\u00b7 O(log k)\n\u2014 break\n\u00b7\u00b7\u00b7 O(B)\n\n\u2022 visited[j] 0,8j 2C (w)\n\u2022 Output: C(w) = {j | \u00af\u21e1(j|w) \uf8ff B}\n\nAlgorithm 3 An improved candidate screening proce-\ndure in Greedy-MIPS. The time complexity is O(Bk).\n\u2022 Input:\n\n- H, w, and the computational budget B\n- Q and iters[t]: output of Algorithm 2\n- C(w): an empty list\n- visited[j] = 0, 8j \uf8ff n: a zero-initialized array.\n\u2022 While |C(w)| < B:\n- (z, t) Q.pop()\n\u00b7\u00b7\u00b7 O(log k)\n- j iters[t].current()\n- If visited[j] = 0:\n* append j into C(w) and visited[j] 1\n- While iters[t].hasNext():\n* j iters[t].getNext()\n* if visited[j] = 0:\n\nTo check weather a index j in the cur-\nrent C(w) in O(1) time, we use an aux-\niliary zero-initialized array of length n:\nvisited[j], j = 1, . . . , n to denote\nwhether an index j has been included in\nC(w) or not. As C(w) contains at most B\nindices, only B elements of this auxiliary\narray will be modi\ufb01ed during the screening\nprocedure. Furthermore, the auxiliary ar-\nray can be reset to zero using O(B) time\nin the end of Algorithm 3, so this auxil-\niary array can be utilized again for a dif-\nferent query vector w. Notice that Algo-\nrithm 3 still iterates Bk entries of Z but\nat most B + k  1 entries will be pushed\ninto or pop from the max-heap Q. Thus, the\noverall time complexity of Algorithm 3 is\nO(Bk + (B + k) log k) = O(Bk), which\nmakes Greedy-MIPS a viable budgeted\nMIPS approach.\n4.3 Connection to Sampling Approaches\nSample-MIPS, as mentioned earlier, is essentially a sampling algorithm with replacement scheme to\ndraw entries of Z such that (j, t) is sampled with the probability proportional to zjt. Thus, Sample-\nMIPS can be thought as a traversal of (j, t) entries using in a strati\ufb01ed random sequence determined\nby a distribution of the values of {zjt}, while the core idea of Greedy-MIPS is to iterate (j, t) entries\nof Z in a greedy sequence induced by the ordering of {zjt}.\nNext, we discuss the differences of Greedy-MIPS from Sample-MIPS and Diamond-MSIPS.\nSample-MIPS can be applied to the situation where both H and w are nonnegative because of the\nnature of sampling scheme. In contrast, Greedy-MIPS can work on any MIPS problems as only the\nordering of {zjt} matters in Greedy-MIPS. Instead of h>j w, Diamond-MSIPS is designed for the\nMSIPS problem which is to identify candidates with largest (h>j w)2 or |h>j w| values. In fact, for\nnonnegative MIPS problems, the diamond sampling is equivalent to Sample-MIPS. Moreover, for\nMSIPS problems with negative entries, when the number of samples is set to be the budget B,2 the\nDiamond-MSIPS is equivalent to apply Sample-MIPS to sample (j, t) entries with the probability\np(j, t) /| zjt|. Thus, the applicability of the existing sampling-based approaches remains limited for\ngeneral MIPS problems.\n4.4 Theoretical Guarantee\nGreedy-MIPS is an algorithm based on a greedy heuristic ranking (4). Similar to the analysis of\nQuicksort, we study the average complexity of Greedy-MIPS by assuming a distribution of the input\ndataset. For simplicity, our analysis is performed on a stochastic implicit matrix Z instead of w. Each\nentry in Z is assumed to follow a uniform distribution uniform(a, b). We establish Theorem 3 to\nprove that the number of entries (j, t) iterated by Greedy-MIPS to include the index to the largest\ncandidate is sublinear to n = |H| with a high probability when n is large enough.\nTheorem 3. Assume that all the entries zjt are drawn from a uniform distribution uniform(a, b).\nLet j\u21e4 be the index to the largest candidate (i.e., \u21e1(j\u21e4|Z) = 1). With high probability, we have\n\u00af\u21e1(j\u21e4|Z) \uf8ffO (k log(n)n 1\nNotice that theoretical guarantees for approximate MIPS is challenging even for randomized algo-\nrithms. For example, the analysis for Diamond-MSIPS in [3] requires nonnegative assumptions and\nonly works on MSIPS (max-squared-inner-product search) problems instead of MIPS problems.\n5 Experimental Results\nIn this section, we perform extensive empirical comparisons to compare Greedy-MIPS with other\nstate-of-the-art fast MIPS approaches on both real-world and synthetic datasets: We use net\ufb02ix and\nyahoo-music as our real-world recommender system datasets. There are 17, 770 and 624, 961 items\nin net\ufb02ix and yahoo-music, respectively. In particular, we obtain the user embeddings {wi}2 Rk\n\nk ). A detailed proof can be found in the supplementary material.\n\n2This setting is used in the experiments in [3].\n\n7\n\n\fFigure 2: MIPS comparison on net\ufb02ix and yahoo-music.\n\nFigure 3: MIPS comparison on synthetic datasets with n 2 2{17,18,19,20} and k = 128. The datasets\nused to generate results are created with each entry drawn from a normal distribution.\n\nFigure 4: MIPS Comparison on synthetic datasets with n = 218 and k 2 2{2,5,7,10}. The datasets\nused to generate results on are created with each entry drawn from a normal distribution.\n\nand item embeddings hj 2 Rk by the standard low-rank matrix factorization [4] with k 2{ 50, 200}.\nWe also generate synthetic datasets with various n = 2{17,18,19,20} and k = 2{2,5,7,10}. For each\nsynthetic dataset, both candidate vector hj and query w vector are drawn from the normal distribution.\n5.1 Experimental Settings\nTo have fair comparisons, all the compared approaches are implemented in C++.\n\u2022 Greedy-MIPS: our proposed approach in Section 4.\n\u2022 PCA-MIPS: the approach proposed in [2]. We vary the depth of PCA tree to control the trade-off.\n\u2022 LSH-MIPS: the approach proposed in [12, 14]. We use the nearest neighbor transform function\nproposed in [2, 12] and use the random projection scheme as the LSH function as suggested in\n[12]. We also implement the standard ampli\ufb01cation procedure with an OR-construction of b hyper\nLSH hash functions. Each hyper LSH function is a result of an AND-construction of a random\nprojections. We vary values (a, b) to control the trade-off.\n\u2022 Diamond-MSIPS: the sampling scheme proposed in [3] for the maximum squared inner product\nsearch. As it shows better performance than LSH-MIPS in [3] in terms of MIPS problems, we\nalso include Diamond-MSIPS into our comparison.\n\u2022 Naive-MIPS: the baseline approach which applies a linear search to identify the exact top-K\nEvaluation Criteria. For each dataset, the actual top-20 items for each query are regarded as the\nground truth. We report the average performance on a randomly selected 2,000 query vectors. To\nevaluate the search quality, we use the precision on the top-P prediction (prec@P ), obtained by\nselecting top-P items from C(w) returned by the candidate screening procedure. Results with P = 5\nis shown in the paper, while more results with various P are in the supplementary material. To\nevaluate the search ef\ufb01ciency, we report the relative speedups over the Naive-MIPS approach:\n\ncandidates.\n\nspeedup =\n\nprediction time required by Naive-MIPS\nprediction time by a compared approach .\n\n8\n\n\fRemarks on Budgeted MIPS versus Non-Budgeted MIPS. As mentioned in Section 3, PCA-\nMIPS and LSH-MIPS cannot handle MIPS with a budget. Both the search computation cost and\nthe search quality are \ufb01xed when the corresponding data structure is constructed. As a result, to\nunderstand the trade-off between search ef\ufb01ciency and search quality for these two approaches, we\ncan only try various values for its parameters (such as the depth for PCA tree and the ampli\ufb01cation\nparameters (a, b) for LSH). For each combination of parameters, we need to re-run the entire\nquery-independent pre-processing procedure to construct a new data structure.\nRemarks on data structure construction. Note that the time complexity for the construction\nfor Greedy-MIPS is O(kn log n), which is on par to O(kn) for Diamond-MSIPS, and faster\nthan O(knab) for LSH-MIPS and O(k2n) for PCA-MIPS. As an example, the construction for\nGreedy-MIPS only takes around 10 seconds on yahoo-music with n = 624, 961 and k = 200.\n5.2 Experimental Results\nResults on Real-World Data sets. Comparison results for net\ufb02ix and yahoo-music are shown\nin Figure 2. The \ufb01rst, second, and third columns present the results with k = 50 and k = 200,\nrespectively. It is clearly observed that given a \ufb01xed speedup, Greedy-MIPS yields predictions with\nmuch higher search quality. In particular, on the yahoo-music data set with k = 200, Greedy-MIPS\nruns 200x faster than Naive-MIPS and yields search results with p@5 = 70%, while none of PCA-\nMIPS, LSH-MIPS, and Diamond-MSIPS can achieve a p@5 > 10% while maintaining the similar\n200x speedups.\nResults on Synthetic Data Sets. We also perform comparisons on synthetic datasets. The com-\nparison with various n 2 2{17,18,19,20} is shown in Figure 3, while the comparison with various\nk 2 2{2,5,7,10} is shown in Figure 4. We observe that the performance gap between Greedy-MIPS\nover other approaches remains when n increases, while the gap becomes smaller when k increases.\nHowever, Greedy-MIPS still outperforms other approaches signi\ufb01cantly.\n6 Conclusions and Future Work\nIn this paper, we develop a novel Greedy-MIPS algorithm, which has the \ufb02exibility to handle\nbudgeted MIPS, and yields surprisingly superior performance compared to state-of-the-art approaches.\nThe current implementation focuses on MIPS with dense vectors, while in the future we plan to\nimplement our algorithm also for high dimensional sparse vectors. We also establish a theoretical\nguarantee for Greedy-MIPS based on the assumption that data are generated from a random\ndistribution. How to relax the assumption or how to design a nondeterministic pre-processing step for\nGreedy-MIPS to satisfy the assumption are interesting future directions of this work.\nAcknowledgements\nThis research was supported by NSF grants CCF-1320746, IIS-1546452 and CCF-1564000. CJH\nwas supported by NSF grant RI-1719097.\nReferences\n[1] Alex Auvolat, Sarath Chandar, Pascal Vincent, Hugo Larochelle, and Yoshua Bengio. Clus-\ntering is ef\ufb01cient for approximate maximum inner product search, 2016. arXiv preprint\narXiv:1507.05910.\n\n[2] Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein,\nNir Nice, and Ulrich Paquet. Speeding up the xbox recommender system using a euclidean trans-\nformation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender\nSystems, pages 257\u2013264, 2014.\n\n[3] Grey Ballard, Seshadhri Comandur, Tamara Kolda, and Ali Pinar. Diamond sampling for\nIn Proceedings of the IEEE\n\napproximate maximum all-pairs dot-product (MAD) search.\nInternational Conference on Data Mining, 2015.\n\n[4] Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. A learning-rate schedule\nfor stochastic gradient methods to matrix factorization. In Proceedings of the Paci\ufb01c-Asia\nConference on Knowledge Discovery and Data Mining (PAKDD), 2015.\n\n[5] Edith Cohen and David D. Lewis. Approximating matrix multiplication for pattern recognition\ntasks. In Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms,\npages 682\u2013691, 1997.\n\n9\n\n\f[6] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommenda-\n\ntions. In Proceedings of the 10th ACM Conference on Recommender Systems, 2016.\n\n[7] Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The Yahoo! music\ndataset and KDD-Cup\u201911. In JMLR Workshop and Conference Proceedings: Proceedings of\nKDD Cup 2011 Competition, volume 18, pages 3\u201318, 2012.\n\n[8] Donald E. Knuth. The Art of Cmoputer Programming, Volumne 3: Sorting and Searching.\n\nAddison-Wesley, 2nd edition, 1998.\n\n[9] Noam Koenigstein, Parikshit Ram, and Yuval Shavitt. Ef\ufb01cient retrieval of recommendations in\na matrix factorization framework. In Proceedings of the 21st ACM International Conference on\nInformation and Knowledge Management, CIKM \u201912, pages 535\u2013544, 2012.\n\n[10] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recom-\n\nmender systems. IEEE Computer, 42:30\u201337, 2009.\n\n[11] Stephen Mussmann and Stefano Ermon. Learning and inference via maximum inner product\nsearch. In Proceedings of the 33rd International Conference on International Conference on\nMachine Learning, 2016.\n\n[12] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product\nsearch. In Proceedings of the International Conference on Machine Learning, pages 1926\u20131934,\n2015.\n\n[13] Parikshit Ram and Alexander G. Gray. Maximum inner-product search using cone trees. In\nProceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 931\u2013939, 2012.\n\n[14] Anshumali Shrivastava and Ping Li. Asymmetric lsh (ALSH) for sublinear time maximum inner\nproduct search (MIPS). In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 27, pages 2321\u20132329,\n2014.\n\n[15] Anshumali Shrivastava and Ping Li. Improved asymmetric locality senstive hashing lsh (ALSH)\nfor maximum inner product search (MIPS). In Proceedings of the Thirty-First Conference on\nUncertainty in Arti\ufb01cial Intelligence (UAI), pages 812\u2013821, 2015.\n\n[16] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to\n\nrank with joint word-image embeddings. Mach. Learn., 81(1):21\u201335, October 2010.\n\n[17] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Large-scale multi-label\nlearning with missing labels. In Proceedings of the International Conference on Machine\nLearning, pages 593\u2013601, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2813, "authors": [{"given_name": "Hsiang-Fu", "family_name": "Yu", "institution": "U Texas"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Qi", "family_name": "Lei", "institution": "Institute for Computational Engineering and Sciences, University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}