{"title": "Efficient Thompson Sampling for Online \ufffcMatrix-Factorization Recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1305, "abstract": "Matrix factorization (MF) collaborative filtering is an effective and widely used method in recommendation systems. However, the problem of finding an optimal trade-off between exploration and exploitation (otherwise known as the bandit problem), a crucial problem in collaborative filtering from cold-start, has not been previously addressed.In this paper, we present a novel algorithm for online MF recommendation that automatically combines finding the most relevantitems with exploring new or less-recommended items.Our approach, called Particle Thompson Sampling for Matrix-Factorization, is based on the general Thompson sampling framework, but augmented with a novel efficient online Bayesian probabilistic matrix factorization method based on the Rao-Blackwellized particle filter.Extensive experiments in collaborative filtering using several real-world datasets demonstrate that our proposed algorithm significantly outperforms the current state-of-the-arts.", "full_text": "Ef\ufb01cient Thompson Sampling for Online\nMatrix-Factorization Recommendation\n\nJaya Kawale, Hung Bui, Branislav Kveton\n\nAdobe Research\n\nSan Jose, CA\n\n{kawale, hubui, kveton}@adobe.com\n\nLong Tran Thanh\n\nUniversity of Southampton\n\nSouthampton, UK\n\nltt08r@ecs.soton.ac.uk\n\nSanjay Chawla\n\nQatar Computing Research Institute, Qatar\n\nUniversity of Sydney, Australia\n\nsanjay.chawla@sydney.edu.au\n\nAbstract\n\nMatrix factorization (MF) collaborative \ufb01ltering is an effective and widely used\nmethod in recommendation systems. However, the problem of \ufb01nding an optimal\ntrade-off between exploration and exploitation (otherwise known as the bandit\nproblem), a crucial problem in collaborative \ufb01ltering from cold-start, has not been\npreviously addressed. In this paper, we present a novel algorithm for online MF\nrecommendation that automatically combines \ufb01nding the most relevant items with\nexploring new or less-recommended items. Our approach, called Particle Thomp-\nson sampling for MF (PTS), is based on the general Thompson sampling frame-\nwork, but augmented with a novel ef\ufb01cient online Bayesian probabilistic matrix\nfactorization method based on the Rao-Blackwellized particle \ufb01lter. Extensive ex-\nperiments in collaborative \ufb01ltering using several real-world datasets demonstrate\nthat PTS signi\ufb01cantly outperforms the current state-of-the-arts.\n\n1\n\nIntroduction\n\nMatrix factorization (MF) techniques have emerged as a powerful tool to perform collaborative\n\ufb01ltering in large datasets [1]. These algorithms decompose a partially-observed matrix R \u2208 RN\u00d7M\ninto a product of two smaller matrices, U \u2208 RN\u00d7K and V \u2208 RM\u00d7K, such that R \u2248 U V T .\nA variety of MF-based methods have been proposed in the literature and have been successfully\napplied to various domains. Despite their promise, one of the challenges faced by these methods\nis recommending when a new user/item arrives in the system, also known as the problem of cold-\nstart. Another challenge is recommending items in an online setting and quickly adapting to the\nuser feedback as required by many real world applications including online advertising, serving\npersonalized content, link prediction and product recommendations.\nIn this paper, we address these two challenges in the problem of online low-rank matrix completion\nby combining matrix completion with bandit algorithms. This setting was introduced in the previous\nwork [2] but our work is the \ufb01rst satisfactory solution to this problem.\nIn a bandit setting, we\ncan model the problem as a repeated game where the environment chooses row i of R and the\nlearning agent chooses column j. The Rij value is revealed and the goal (of the learning agent) is\nto minimize the cumulative regret with respect to the optimal solution, the highest entry in each row\nof R. The key design principle in a bandit setting is to balance between exploration and exploitation\nwhich solves the problem of cold start naturally. For example, in online advertising, exploration\nimplies presenting new ads, about which little is known and observing subsequent feedback, while\nexploitation entails serving ads which are known to attract high click through rate.\n\n1\n\n\fWhile many solutions have been proposed for bandit problems, in the last \ufb01ve years or so, there\nhas been a renewed interest in the use of Thompson sampling (TS) which was originally proposed\nin 1933 [3, 4]. In addition to having competitive empirical performance, TS is attractive due to its\nconceptual simplicity. An agent has to choose an action a (column) from a set of available actions so\nas to maximize the reward r, but it does not know with certainty which action is optimal. Following\nTS, the agent will select a with the probability that a is the best action. Let \u03b8 denotes the unknown\nparameter governing reward structure, and O1:t the history of observations currently available to the\nagent. The agent chooses a\u2217 = a with probability\n\n(cid:90)\n\nI(cid:104)E [r|a, \u03b8] = max\n\na(cid:48)\n\n(cid:105)\n\n(cid:48)\nE [r|a\n\n, \u03b8]\n\nP (\u03b8|O1:t)d\u03b8\n\nwhich can be implemented by simply sampling \u03b8 from the posterior P (\u03b8|O1:t) and let a\u2217 =\narg maxa(cid:48) E [r|a(cid:48), \u03b8]. However for many realistic scenarios (including for matrix completion), sam-\npling from P (\u03b8|O1:t) is not computationally ef\ufb01cient and thus recourse to approximate methods is\nrequired to make TS practical.\nWe propose a computationally-ef\ufb01cient algorithm for solving our problem, which we call Particle\nThompson sampling for matrix factorization (PTS). PTS is a combination of particle \ufb01ltering for\nonline Bayesian parameter estimation and TS in the non-conjugate case when the posterior does\nnot have a closed form. Particle \ufb01ltering uses a set of weighted samples (particles) to estimate\nthe posterior density. In order to overcome the problem of the huge parameter space, we utilize\nRao-Blackwellization and design a suitable Monte Carlo kernel to come up with a computationally\nand statistically ef\ufb01cient way to update the set of particles as new data arrives in an online fashion.\nUnlike the prior work [2] which approximates the posterior of the latent item features by a single\npoint estimate, our approach can maintain a much better approximation of the posterior of the latent\nfeatures by a diverse set of particles. Our results on \ufb01ve different real datasets show a substantial\nimprovement in the cumulative regret vis-a-vis other online methods.\n2 Probabilistic Matrix Factorization\n\nWe \ufb01rst review the probabilistic matrix factorization approach to\nthe low-rank matrix completion problem. In matrix completion, a\nportion Ro of the N \u00d7 M matrix R = (rij) is observed, and the\ngoal is to infer the unobserved entries of R. In probabilistic matrix\nfactorization (PMF) [5], R is assumed to be a noisy perturbation of\na rank-K matrix \u00afR = U V (cid:62) where UN\u00d7K and VM\u00d7K are termed\nthe user and item latent features (K is typically small). The full\ngenerative model of PMF is\nUi i.i.d. \u223c\nVj i.i.d. \u223c\n\nN (0, \u03c32\nN (0, \u03c32\n\n(1)\n\nuIK)\nvIK)\n(cid:62)\ni Vj, \u03c32)\n\nrij|U, V i.i.d. \u223c N (U\n\nFigure 1: Graphical model of\nprobabilistic matrix factoriza-\ntion model\n\nU , \u03c32\n\n\u22122\nU \u223c \u0393(\u03b1, \u03b2); \u03bbV = \u03c3\n\nwhere the variances (\u03c32, \u03c32\nV ) are the parameters of the model.\nWe also consider a full Bayesian treatment where the variances\nV are drawn from an inverse Gamma prior (while \u03c32\nU and \u03c32\n\u03c32\n\u22122\nis held \ufb01xed), i.e., \u03bbU = \u03c3\nV \u223c \u0393(\u03b1, \u03b2) (this is a special case of the\nBayesian PMF [6] where we only consider isotropic Gaussians)1. Given this generative model,\nfrom the observed ratings Ro, we would like to estimate the parameters U and V which will al-\nlow us to \u201ccomplete\u201d the matrix R. PMF is a MAP point-estimate which \ufb01nds U, V to maximize\nPr(U, V |Ro, \u03c3, \u03c3U , \u03c3V ) via (stochastic) gradient ascend (alternate least square can also be used [1]).\nBayesian PMF [6] attempts to approximate the full posterior Pr(U, V |Ro, \u03c3, \u03b1, \u03b2). The joint pos-\nterior of U and V are intractable; however, the structure of the graphical model (Fig. 1) can be\nexploited to derive an ef\ufb01cient Gibbs sampler.\nWe now provide the expressions for the conditional probabilities of interest. Supposed that V and\n\u03c3U are known. Then the vectors Ui are independent for each user i. Let rts(i) = {j|rij \u2208 Ro} be\nij|j \u2208 rts(i)} are generated i.i.d. from Ui\nthe set of items rated by user i, observe that the ratings {Ro\n1[6] considers the full covariance structure, but they also noted that isotropic Gaussians are effective enough.\n\n2\n\n\u21b5,vVjuUiMN\u21e5MRij\ffollowing a simple conditional linear Gaussian model. Thus, the posterior of Ui has the closed form\n(2)\n\n(cid:88)\n(cid:88)\ni,rts(i), \u03c3U , \u03c3) = N (Ui|\u00b5u\nPr(Ui|V, Ro, \u03c3, \u03c3U ) = Pr(Ui|Vrts(i), Ro\ni =\nVjV\n\ni )\u22121)\nro\nijVj.\n\ni )\u22121\u03b6 u\n\nwhere \u00b5u\n\ni ; \u039bu\n\n(cid:62)\nj +\n\ni , (\u039bu\n\n\u03b6 u\ni =\n\ni =\n\nIK;\n\n(3)\n\n1\n\u03c32\n\nj\u2208rts(i)\n\n1\n\u03c32\nu\n\n1\n\u03c32 (\u039bu\n\nj\u2208rts(i)\n\nconditional\n\nj=1 N (Vj|\u00b5v\n\nj , (\u039bv\n\nof V , Pr(V |U, Ro, \u03c3V , \u03c3)\n\nposterior\ninto\nj )\u22121) where the mean and precision are similarly de\ufb01ned. The posterior\n\u22122\nU given U (and simiarly for \u03bbV ) is obtained from the conjugacy of the\n\nfactorized\n\nsimilarly\n\nis\n\nof the precision \u03bbU = \u03c3\nGamma prior and the isotropic Gaussian\n\n(cid:81)M\n\nThe\n\nPr(\u03bbU|U, \u03b1, \u03b2) = \u0393(\u03bbU|\n\nN K\n\n2\n\n+ \u03b1,\n\n1\n\n2 (cid:107)U(cid:107)2\n\nF + \u03b2).\n\nAlthough not required for Bayesian PMF, we give the likelihood expression\n\nPr(Rij = r|V, Ro, \u03c3U , \u03c3) = N (r|V\n\n(cid:62)\nj \u00b5u\ni ,\n\n1\n\u03c32 + V\n\n(cid:62)\nj \u039bV,iVj).\n\n(4)\n\n(5)\n\nThe advantage of the Bayesian approach is that uncertainty of the estimate of U and V are available\nwhich is crucial for exploration in a bandit setting. However, the bandit setting requires maitaining\nonline estimates of the posterior as the ratings arrive over time which makes it rather awkward\nfor MCMC. In this paper, we instead employ a sequential Monte-Carlo (SMC) method for online\nBayesian inference [7, 8]. Similar to the Gibbs sampler [6], we exploit the above closed form updates\nto design an ef\ufb01cient Rao-Blackwellized particle \ufb01lter [9] for maintaining the posterior over time.\n\n3 Matrix-Factorization Recommendation Bandit\nIn a typical deployed recommendation system, users and observed ratings (also called rewards)\narrive over time, and the task of the system is to recommend item for each user so as to maximize\nthe accumulated expected rewards. The bandit setting arises from the fact that the system needs to\nlearn over time what items have the best ratings (for a given user) to recommend, and at the same\ntime suf\ufb01ciently explore all the items.\nWe formulate the matrix factorization bandit as follows. We assume that ratings are generated\nfollowing Eq. (1) with a \ufb01xed but unknown latent features (U\u2217, V \u2217). At time t, the environment\nchooses user it and the system (learning agent) needs to recommend an item jt. The user then\nV \u2217\nrates the recommended item with rating rit,jt \u223c N (U\u2217\njt, \u03c32) and the agent receives this rating\nt = rit,jt. The system recommends item jt using a policy\nas a reward. We abbreviate this as ro\n1:t\u22121, where ro\nthat takes into account the history of the observed ratings prior to time t, ro\n1:t =\nV \u2217\nj , and\n{(ik, jk, ro\nis recommended. Since (U\u2217, V \u2217)\nthis is achieved if the optimal item j\u2217(i) = arg maxj U\u2217\nare unknown, the optimal item j\u2217(i) is also not known a priori. The quality of the recommendation\nsystem is measured by its expected cumulative regret:\n\nk=1. The highest expected reward the system can earn at time t is maxj U\u2217\n\nk)}t\n\nV \u2217\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nit\n\nj\n\ni\n\ni\n\nCR = E\n\n[ro\nt \u2212 rit,j\u2217(it)]\n\n= E\n\n[ro\nt \u2212 max\n\nj\n\n\u2217\nit\n\nU\n\n(cid:62)\n\n\u2217\nj ]\nV\n\n(6)\n\nwhere the expectation is taken with respect to the choice of the user at time t and also the randomness\nin the choice of the recommended items by the algorithm.\n\n3.1 Particle Thompson Sampling for Matrix Factorization Bandit\n\nWhile it is dif\ufb01cult to optimize the cumulative regret directly, TS has been shown to work well in\npractice for contextual linear bandit [3]. To use TS for matrix factorization bandit, the main dif\ufb01culty\nis to incrementally update the posterior of the latent features (U, V ) which control the reward struc-\nture. In this subsection, we describe an ef\ufb01cient Rao-Blackwellized particle \ufb01lter (RBPF) designed\nto exploit the speci\ufb01c structure of the probabilistic matrix factorization model. Let \u03b8 = (\u03c3, \u03b1, \u03b2) be\n1:t, \u03b8). The standard\nthe control parameters and let posterior at time t be pt = Pr(U, V, \u03c3U , \u03c3V ,|ro\n\n3\n\n(cid:34) n(cid:88)\n\nt=1\n\n(cid:35)\n\n(cid:34) n(cid:88)\n\nt=1\n\n(cid:35)\n\n\f\u02dcVj\n\ni\n\n(cid:46) sample new Ui due to Rao-Blackwellization\n\n(cid:46) \u02c6p has the structure (w, particles) where particles[d] = (U (d), V (d), \u03c3(d)\n\nU , \u03c3(d)\nV ).\n\nU\n\n1:t\u22121)\n\nAlgorithm 1 Particle Thompson Sampling for Matrix Factorization (PTS)\nGlobal control params: \u03c3, \u03c3U , \u03c3V ; for Bayesian version (PTS-B): \u03c3, \u03b1, \u03b2\n1: \u02c6p0 \u2190 InitializeParticles()\n2: Ro = \u2205\n3: for t = 1, 2 . . . do\ni \u2190 current user\n4:\nSample d \u223c \u02c6pt\u22121.w\n5:\n\u02dcV \u2190 \u02c6pt\u22121.V (d)\n6:\n[If PTS-B] \u02dc\u03c3U \u2190 \u02c6pt\u22121.\u03c3(d)\n7:\nSample \u02dcUi \u223c Pr(Ui| \u02dcV , \u02dc\u03c3U , \u03c3, ro\n8:\n\u02c6j \u2190 arg maxj \u02dcU(cid:62)\n9:\nRecommend \u02c6j for user i and observe rating r.\n10:\nt \u2190 (i, \u02c6j, r)\n11:\nro\n\u02c6pt \u2190 UpdatePosterior(\u02c6pt\u22121, ro\n12:\n13: end for\n14: procedure UPDATEPOSTERIOR(\u02c6p, ro\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29: end procedure\n\n(i, j, r) \u2190 ro\n\u2200d, \u039bu\n1:t\u22121), \u03b6 u\n1:t\u22121)\ni\n\u2200d, wd \u221d Pr(Rij = r|V (d), \u03c3(d)\n\u2200d, i \u223c \u02c6p.w; \u02c6p(cid:48).particles[d] \u2190 \u02c6p.particles[i]; \u2200d, \u02c6p(cid:48).wd \u2190 1\nfor all d do\n\n1:t\u22121), see Eq.(5),(cid:80) wd = 1\n\n(d) \u2190 \u039bu\nj ; \u03b6 u\ni\ni \u223c Pr(Ui|\u02c6p(cid:48).V (d), \u02c6p(cid:48).\u03c3(d)\n(d) \u2190 \u039bv\nj \u223c Pr(Vj|\u02c6p(cid:48).U (d), \u02c6p(cid:48).\u03c3(d)\n\n(d) \u2190 \u03b6 u\n\u039bu\ni\ni\n\u02c6p(cid:48).U (d)\nU , \u03c3, ro\n1:t)\n[If PTS-B] Update the norm of \u02c6p(cid:48).U (d)\n(d) \u2190 \u03b6 u\n\u039bv\nj\n\u02c6p(cid:48).V (d)\n[If PTS-B] \u02c6p(cid:48).\u03c3(d)\n\n1:t)\nU \u223c Pr(\u03c3U|\u02c6p(cid:48).U (d), \u03b1, \u03b2)\n\nend for\nreturn \u02c6p(cid:48)\n\n\u03c32 VjV (cid:62)\n\nj (V (d), ro\n\n1:t), \u03b6 v\nj\n\ni (V (d), ro\n\n1:t)\n\nt\n\n(d) \u2190 \u039bu\n\n(d) \u2190 \u03b6 u\n\ni (V (d), ro\n\ni (V (d), ro\n\n1:t)\n\n1:t)\n\n(d) + 1\n\ni\n\n(d) + rVj\n\nV , \u03c3, ro\n\ni\n\nU , \u03c3, ro\n\nD\n\n(cid:46) see Eq. (3)\n(cid:46) Reweighting; see Eq.(5)\n(cid:46) Resampling\n(cid:46) Move\n\n(cid:46) see Eq. (2)\n\n(cid:46) see Eq.(4)\n\n(cid:80)D\n\nparticle \ufb01lter would sample all of the parameters (U, V, \u03c3U , \u03c3V ). Unfortunately, in our experi-\nments, degeneracy is highly problematic for such a vanilla particle \ufb01lter (PF) even when \u03c3U , \u03c3V\nare assumed known (see Fig. 4(b)). Our RBPF algorithm maintains the posterior distribution pt as\nfollows. Each of the particle conceptually represents a point-mass at V, \u03c3U (U and \u03c3V are integrated\nout analytically whenever possible)2. Thus, pt(V, \u03c3U ) is approximated by \u02c6pt = 1\nd=1 \u03b4(V (d),\u03c3(d)\nD\nU )\nwhere D is the number of particles.\nCrucially, since the particle \ufb01lter needs to estimate a set of non-time-vayring parameters, having\nan effective and ef\ufb01cient MCMC-kernel move Kt(V (cid:48), \u03c3(cid:48)\nU ; V, \u03c3U ) stationary w.r.t. pt is essential.\nOur design of the move kernel Kt are based on two observations. First, we can make use of\nU and \u03c3V as auxiliary variables, effectively sampling U, \u03c3V |V, \u03c3U \u223c pt(U, \u03c3V |V, \u03c3U ), and then\nV (cid:48), \u03c3(cid:48)\nU|U, \u03c3V ). However, this move would be highly inef\ufb01cient due to the num-\nber of variables that need to be sampled at each update. Our second observation is the key to an\nef\ufb01cient implementation. Note that latent features for all users except the current user U\u2212it are in-\ndependent of the current observed rating ro\nt : pt(U\u2212it|V, \u03c3U ) = pt\u22121(U\u2212it|V, \u03c3U ), therefore at time\nt we only have to resample Uit as there is no need to resample U\u2212it. Furthermore, it suf\ufb01ces to\nresample the latent feature of the current item Vjt. This leads to an ef\ufb01cient implementation of the\nRBPF where each particle in fact stores3 U, V, \u03c3U , \u03c3V , where (U, \u03c3V ) are auxiliary variables, and\nfor the kernel move Kt, we sample Uit|V, \u03c3U then V (cid:48)\nThe PTS algorithm is given in Algo. 1. At each time t, the complexity is O((( \u02c6N + \u02c6M )K 2 + K 3)D)\nwhere \u02c6N and \u02c6M are the maximum number of users who have rated the same item and the maximum\n\nU|U, \u03c3V \u223c pt(V (cid:48), \u03c3(cid:48)\n\njt|U, \u03c3V and \u03c3(cid:48)\n\nU|U, \u03b1, \u03b2.\n\n2When there are fewer users than items, a similar strategy can be derived to integrate out U and \u03c3V instead.\n3This is not inconsistent with our previous statement that conceptually a particle represents only a point-\n\nmass distribution \u03b4V,\u03c3U .\n\n4\n\n\fj and \u03b6 v\n\nnumber of items rated by the same user, respectively. The dependency on K 3 arises from having\nto invert the precision matrix, but this is not a concern since the rank K is typically small. Line\n24 can be replaced by an incremental update with caching: after line 22, we can incrementally\nj for all item j previously rated by the current user i. This reduces the complexity\nupdate \u039bv\nto O(( \u02c6M K 2 + K 3)D), a potentially signi\ufb01cant improvement in a real recommendation systems\nwhere each user tends to rate a small number of items.\n4 Analysis\nWe believe that the regret of PTS can be bounded. However, the existing work on TS and bandits\ndoes not provide suf\ufb01cient tools for proper analysis of our algorithm. In particular, while existing\ntechniques can provide O(log T ) (or O(\u221aT ) for gap-independent) regret bounds for our problem,\nthese bounds are typically linear in the number of entries of the observation matrix R (or at least\nlinear in the number of users), which is typically very large, compared to T . Thus, an ideal regret\nbound in our setting is the one that has sub-linear dependency (or no dependency at all) on the\nnumber of users. A key obstacle of achieving this is that, while the conditional posteriors of U and\nV are Gaussians, neither their marginal and joint posteriors belong to well behaved classes (e.g.,\nconjugate posteriors, or having closed forms). Thus, novel tools, that can handle generic posteriors,\nare needed for ef\ufb01cient analysis. Moreover, in the general setting, the correlation between Ro and\nthe latent features U and V are non-linear (see, e.g., [10, 11, 12] for more details). As existing\ntechniques are typically designed for ef\ufb01ciently learning linear regressions, they are not suitable for\nour problem. Nevertheless, we show how to bound the regret of TS in a very speci\ufb01c case of n \u00d7 m\nrank-1 matrices, and we leave the generalization of these results for future work.\nIn particular, we analyze the regret of PTS in the setting of Gopalan et al. [13]. We model our\n\n\u2208 \u0398u and v\u2217\nj\u2217 > u\u2217\n\nj and assume that it is uniquely optimal, u\u2217\n\nproblem as follows. The parameter space is \u0398u \u00d7 \u0398v, where \u0398u = {d, 2d, . . . , 1}N\u00d71 and \u0398v =\n{d, 2d, . . . , 1}M\u00d71 are discretizations of the parameter spaces of rank-1 factors u and v for some\ninteger 1/d. For the sake of theoretical analysis, we assume that PTS can sample from the full\nposterior. We also assume that ri,j \u223c N (u\u2217\ni v\u2217\n\u2208 \u0398u. Note that\nin this setting, the highest-rated item in expectation is the same for all users. We denote this item\nby j\u2217 = arg max 1\u2264j\u2264M v\u2217\nj for any j (cid:54)= j\u2217. We\nleverage these properties in our analysis. The random variable Xt at time t is a pair of a random\nrating matrix Rt = {ri,j}N,M\ni=1,j=1 and a random row 1 \u2264 it \u2264 N. The action At at time t is a\ncolumn 1 \u2264 jt \u2264 M. The observation is Yt = (it, rit,jt). We bound the regret of PTS as follows.\nTheorem 1. For any \u03b4 \u2208 (0, 1) and \u0001 \u2208 (0, 1), there exists T \u2217 such that PTS on \u0398u \u00d7 \u0398v recom-\nmends items j (cid:54)= j\u2217 in T \u2265 T \u2217 steps at most (2M 1+\u0001\n\u03c32\nd4 log T + B) times with probability of at\n1\u2212\u0001\nleast 1 \u2212 \u03b4, where B is a constant independent of T .\nProof. By Theorem 1 of Gopalan et al. [13], the number of recommendations j (cid:54)= j\u2217 is bounded by\nC(log T ) + B, where B is a constant independent of T . Now we bound C(log T ) by counting the\nnumber of times that PTS selects models that cannot be distinguished from (u\u2217, v\u2217) after observing\nYt under the optimal action j\u2217. Let:\n\nj , \u03c32) for some u\u2217\n\n\u0398j =(cid:8)(u, v) \u2208 \u0398u \u00d7 \u0398v : \u2200i : uivj\u2217 = u\u2217\n\ni v\u2217\n\nj\u2217 , vj \u2265 maxk(cid:54)=j vk\n\nbe the set of such models where action j is optimal. Suppose that our algorithm chooses model\n(u, v) \u2208 \u0398j. Then the KL divergence between the distributions of ratings ri,j under models (u, v)\nand (u\u2217, v\u2217) is bounded from below as:\n\n(cid:9)\n\n\u2217\nDKL(uivj (cid:107) u\ni v\n\n\u2217\nj ) =\n\ni v\u2217\n(uivj \u2212 u\u2217\nj )2\n\nd4\n2\u03c32 .\n\n\u2265\n\n2\u03c32\n\ncause j\u2217 is uniquely optimal in (u\u2217, v\u2217). We know that(cid:12)(cid:12)uivj \u2212 u\u2217\nfor any i. The last inequality follows from the fact that uivj \u2265 uivj\u2217 = u\u2217\ni v\u2217\nis(cid:80)n\nularity of our discretization is d. Let i1, . . . , in be any n row indices. Then the KL divergence\nbetween the distributions of ratings in positions (i1, j), . . . , (in, j) under models (u, v) and (u\u2217, v\u2217)\nare unlikely to be chosen by PTS in T steps when(cid:80)n\nt=1 DKL(uitvj (cid:107) u\u2217\nitv\u2217\n2\u03c32 . By Theorem 1 of Gopalan et al. [13], the models (u, v) \u2208 \u0398j\nj ) \u2265 log T . This happens\nafter at most n \u2265 2 1+\u0001\nd4 log T selections of (u, v) \u2208 \u0398j. Now we apply the same argument to all\n1\u2212\u0001\n\u0398j, M \u2212 1 in total, and sum up the corresponding regrets.\n\n(cid:12)(cid:12) \u2265 d2 because the gran-\n\nitv\u2217\nt=1 DKL(uitvj (cid:107) u\u2217\n\nj ) \u2265 n d4\n\nj\u2217 > u\u2217\n\nj , be-\n\ni v\u2217\n\ni v\u2217\n\n\u03c32\n\nj\n\n5\n\n\f\u03c32\nRemarks: Note that Theorem 1 implies at O(2M 1+\u0001\nd4 log T ) regret bound that holds with high\n1\u2212\u0001\nprobability. Here, d2 plays the role of a gap \u2206, the smallest possible difference between the expected\nratings of item j (cid:54)= j\u2217 in any row i.\nIn this sense, our result is O((1/\u22062) log T ) and is of a\nsimilar magnitude as the results in Gopalan et al. [13]. While we restrict u\u2217, v\u2217\n\u2208 (0, 1]K\u00d71 in\nthe proof, this does not affect the algorithm. In fact, the proof only focuses on high probability\nevents where the samples from the posterior are concentrated around the true parameters, and thus,\nare within (0, 1]K\u00d71 as well. Extending our proof to the general setting is not trivial. In particular,\nmoving from discretized parameters to continuous space introduces the abovementioned ill behaved\nposteriors. While increasing the value of K will violate the fact that the best item will be the same\nfor all users, which allowed us to eliminate N from the regret bound.\n5 Experiments and Results\nThe goal of our experimental evaluation is twofold: (i) evaluate the PTS algorithm for making online\nrecommendations with respect to various baseline algorithms on several real-world datasets and (ii)\nunderstand the qualitative performance and intuition of PTS.\n5.1 Dataset description\nWe use a synthetic dataset and \ufb01ve real world datasets to evaluate our approach. The synthetic\ndataset is generated as follows - At \ufb01rst we generate the user and item latent features (U and V )\nv) respectively. The true\nof rank K by drawing from a Gaussian distribution N (0, \u03c32\nrating matrix is then R\u2217 = U V T . We generate the observed rating matrix R from R\u2217 by adding\nGaussian noise N (0, \u03c32) to the true ratings. We use \ufb01ve real world datasets as follows: Movielens\n100k, Movielens 1M, Yahoo Music4, Book crossing5 and EachMovie as shown in Table 1.\n\nu) and N (0, \u03c32\n\n# users\n# items\n# ratings\n\nMovielens 100k Movielens 1M Yahoo Music Book crossing\n\n943\n1682\n100k\n\n6841\n5644\n90k\nTable 1: Characteristics of the datasets used in our study\n\n6040\n3900\n1M\n\n15400\n1000\n\n311,704\n\nEachMovie\n\n36656\n1621\n2.58M\n\n5.2 Baseline measures\n\nThere are no current approaches available that simultaneously learn both the user and item factors\nby sampling from the posterior in a bandit setting. From the currently available algorithms, we\nchoose two kinds of baseline methods - one that sequentially updates the the posterior of the user\nfeatures only while \ufb01xing the item features to a point estimate (ICF) and another that updates the\nMAP estimates of user and item features via stochastic gradient descent (SGD-Eps). A key chal-\nlenge in online algorithms is unbiased of\ufb02ine evaluation. One problem in the of\ufb02ine setting is the\npartial information available about user feedback, i.e., we only have information about the items\nthat the user rated. In our experiment, we restrict the recommendation space of all the algorithms\nto recommend among the items that the user rated in the entire dataset which makes it possible to\nempirically measure regret at every interaction. The baseline measures are as follows:\n1) Random : At each iteration, we recommend a random movie to the user.\n2) Most Popular : At each iteration, we recommend the most popular movie restricted to the movies\nrated by the user on the dataset. Note that this is an unrealistically optimistic baseline for an online\nalgorithm as it is not possible to know the global popularity of the items beforehand.\n3) ICF: The ICF algorithm [2] proceeds by \ufb01rst estimating the user and item latent factors (U and\nV ) on a initial training period and then for every interaction thereafter only updates the user features\n(U) assuming the item features (V ) as \ufb01xed. We run two scenarios for the ICF algorithm one in\nwhich we use 20% (icf-20) and 50% (icf-50) of the data as the training period respectively. During\nthis period of training, we randomly recommend a movie to the user to compute the regret. We use\nthe PMF implementation by [5] for estimating the U and V .\n4) SGD-Eps: We learn the latent factors using an online variant of the PMF algorithm [5]. We use\nthe stochastic gradient descent to update the latent factors with a mini-batch size of 50. In order\nto make a recommendation, we use the \u0001-greedy strategy and recommend the highest UiV T with a\nprobability \u0001 and make a random recommendations otherwise. (\u0001 is set as 0.95 in our experiments.)\n\n4http://webscope.sandbox.yahoo.com/\n5http://www.bookcrossing.com\n\n6\n\n\f5.3 Results on Synthetic Dataset\n\nWe generated the synthetic dataset as mentioned earlier and run the PTS algorithm with 100 particles\nfor recommendations. We simulate the setting as mentioned in Section 3 and assume that at time t,\na random user it arrives and the system recommends an item jt. The user rates the recommended\nitem rit,jt and we evaluate the performance of the model by computing the expected cumulative\nregret de\ufb01ned in Eq(6). Fig. 2 shows the cumulative regret of the algorithm on the synthetic data\naveraged over 100 runs using different size of the matrix and latent features K. The cumulative\nregret increases sub-linearly with the number of interactions and this gives us con\ufb01dence that our\napproach works well on the synthetic dataset.\n\n(a) N, M=10,K=1\n\n(b) N, M=20,K=1\n\n(c) N, M=30,K=1\n\n(d) N, M=10,K=2\n\n(e) N, M=10,K=3\n\nFigure 2: Cumulative regret on different sizes of the synthetic data and K averaged over 100 runs.\n\n5.4 Results on Real Datasets\n\n(a) Movielens 100k\n\n(b) Movielens 1M\n\n(c) Yahoo Music\n\n(d) Book Crossing\n\n(e) EachMovie\n\nFigure 3: Comparison with baseline methods on \ufb01ve datasets.\n\nNext, we evaluate our algorithms on \ufb01ve real datasets and compare them to the various baseline\nalgorithms. We subtract the mean ratings from the data to centre it at zero. To simulate an extreme\ncold-start scenario we start from an empty set of user and rating. We then iterate over the datasets\nand assume that a random user it has arrived at time t and the system recommends an item jt\nconstrained to the items rated by this user in the dataset. We use K = 2 for all the algorithms and\nuse 30 particles for our approach. For PTS we set the value of \u03c32 = 0.5 and \u03c32\nv = 1.\nFor PTS-B (Bayesian version, see Algo. 1 for more details), we set \u03c32 = 0.5 and the initial shape\nparameters of the Gamma distribution as \u03b1 = 2 and \u03b2 = 0.5. For both ICF-20 and ICF-50, we set\nu = 1. Fig. 3 shows the cumulative regret of all the algorithms on the \ufb01ve datasets6.\n\u03c32 = 0.5 and \u03c32\nOur approach performs signi\ufb01cantly better as compared to the baseline algorithms on this diverse\nset of datasets. PTS-B with no parameter tuning performs slightly better than PTS and achieves the\nbest regret. It is important to note that both PTS and PTS-B performs comparable to or even better\nthan the \u201cmost popular\u201d baseline despite not knowing the global popularity in advance. Note that\nICF is very sensitive to the length of the initial training period; it is not clear how to set this apriori.\n\nu = 1, \u03c32\n\n6ICF-20 fails to run on the Bookcrossing dataset as the 20% data is too sparse for the PMF implementation.\n\n7\n\n0204060801000102030405060IterationsCummulative Regret0100200300400500020406080100120140160180200IterationsCummulative Regret02004006008001000050100150200250300350400450IterationsCummulative Regret020406080100020406080100120IterationsCummulative Regret020406080100020406080100120140IterationsCummulative Regret0246810x 104051015x 104IterationsCummulative Regret PTSrandompopularicf\u221220icf\u221250sgd\u2212epsPTS\u2212B024681012x 105051015x 105IterationsCummulative Regret PTSrandompopularicf\u221220icf\u221250sgd\u2212epsPTS\u2212B00.511.522.533.5x 1050123456x 105IterationsCummulative Regret PTSrandompopularicf\u221220icf\u221250sgd\u2212epsPTS\u2212B0246810x 104024681012141618x 104IterationsCummulative Regret PTSrandompopularicf\u221250sgd\u2212epsPTS\u2212B00.511.522.53x 1060123456x 106IterationsCummulative Regret PTSrandompopularicf\u221220icf\u221250sgd\u2212epsPTS\u2212B\f(a) Movielens 1M\n\n(b) RB particle \ufb01lter\n\n(c) Movie feature vector\n\nFigure 4: a) shows MSE on movielens 1M dataset, the red line is the MSE using the PMF algorithm\nb) shows performance of a RBPF (blue line) as compared to vanilla PF (red line) on a synthetic\ndataset N,M=10 and c) shows movie feature vectors for a movie with 384 ratings, the red dot is the\nfeature vector from the ICF-20 algorithm (using 73 ratings). PTS-20 is the feature vector at 20% of\nthe data (green dots) and PTS-100 at 100% (blue dots).\n\nWe also evaluate the performance of our model in an of\ufb02ine setting as follows: We divide the\ndatasets into training and test set and iterate over the training data triplets (it, jt, rt) by pretending\nthat jt is the movie recommended by our approach and update the latent factors according to RBPF.\nWe compute the recovered matrix \u02c6R as the average prediction U V T from the particles at each time\nstep and compute the mean squared error (MSE) on the test dataset at each iteration. Unlike the\nbatch method such as PMF which takes multiple passes over the data, our method was designed to\nhave bounded update complexity at each iteration. We ran the algorithm using 80% data for training\nand the rest for testing and computed the MSE by averaging the results over 5 runs. Fig. 4(a) shows\nthe average MSE on the movielens 1M dataset. Our MSE (0.7925) is comparable to the PMF MSE\n(0.7718) as shown by the red line. This demonstrates that the RBPF is performing reasonably well\nfor matrix factorization. In addition, Fig. 4(b) shows that on the synthetic dataset, the vanilla PF\nsuffers from degeneration as seen by the high variance. To understand the intuition why \ufb01xing the\nlatent item features V as done in the ICF does not work, we perform an experiment as follows: We\nrun the ICF algorithm on the movielens 100k dataset in which we use 20% of the data for training.\nAt this point the ICF algorithm \ufb01xes the item features V and only updates the user features U. Next,\nwe run our algorithm and obtain the latent features. We examined the features for one selected movie\nfrom the particles at two time intervals - one when the ICF algorithm \ufb01xes them at 20% and another\none in the end as shown in the Fig. 4(c). It shows that movie features have evolved into a different\nlocation and hence \ufb01xing them early is not a good idea.\n6 Related Work\nProbabilistic matrix completion in a bandit setting setting was introduced in the previous work by\nZhao et al. [2]. The ICF algorithm in [2] approximates the posterior of the latent item features by\na single point estimate. Several other bandit algorithms for recommendations have been proposed.\nValko et al. [14] proposed a bandit algorithm for content-based recommendations. In this approach,\nthe features of the items are extracted from a similarity graph over the items, which is known in\nadvance. The preferences of each user for the features are learned independently by regressing the\nratings of the items from their features. The key difference in our approach is that we also learn\nthe features of the items. In other words, we learn both the user and item factors, U and V , while\n[14] learn only U. Kocak et al. [15] combine the spectral bandit algorithm in [14] with TS. Gentile\net al. [16] propose a bandit algorithm for recommendations that clusters users in an online fashion\nbased on the similarity of their preferences. The preferences are learned by regressing the ratings of\nthe items from their features. The features of the items are the input of the learning algorithm and\nthey only learn U. Maillard et al. [17] study a bandit problem where the arms are partitioned into\nunknown clusters unlike our work which is more general.\n7 Conclusion\nWe have proposed an ef\ufb01cient method for carrying out matrix factorization (M \u2248 U V T ) in a bandit\nsetting. The key novelty of our approach is the combined use of Rao-Blackwellized particle \ufb01ltering\nand Thompson sampling (PTS) in matrix factorization recommendation. This allows us to simul-\ntaneously update the posterior probability of U and V in an online manner while minimizing the\ncumulative regret. The state of the art, till now, was to either use point estimates of U and V or use\na point estimate of one of the factor (e.g., U) and update the posterior probability of the other (V ).\nPTS results in substantially better performance on a wide variety of real world data sets.\n\n8\n\n\u2212200020040060080010000.511.522.533.54Iterations x 1000MSE test errorpmf020406080100\u22122\u2212101234IterationsMSE RBNo RB\u22120.4\u22120.200.20.40.6\u22120.8\u22120.6\u22120.4\u22120.200.20.40.6 ICF\u221220PTS\u221220PTS\u2212100\fReferences\n[1] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-\n\nmender systems. Computer, 42(8):30\u201337, 2009.\n\n[2] Xiaoxue Zhao, Weinan Zhang, and Jun Wang. Interactive collaborative \ufb01ltering. In Proceed-\nings of the 22nd ACM international conference on Conference on information & knowledge\nmanagement, pages 1411\u20131420. ACM, 2013.\n\n[3] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In NIPS,\n\npages 2249\u20132257, 2011.\n\n[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear\n\npayoffs. In ICML (3), pages 127\u2013135, 2013.\n\n[5] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In NIPS, volume 1,\n\npages 2\u20131, 2007.\n\n[6] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using\n\nmarkov chain monte carlo. In ICML, pages 880\u2013887, 2008.\n\n[7] Nicolas Chopin. A sequential particle \ufb01lter method for static models. Biometrika, 89(3):539\u2013\n\n552, 2002.\n\n[8] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[9] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao-blackwellised\nparticle \ufb01ltering for dynamic bayesian networks. In Proceedings of the Sixteenth conference\non Uncertainty in arti\ufb01cial intelligence, pages 176\u2013183. Morgan Kaufmann Publishers Inc.,\n2000.\n\n[10] A. Gelman and X. L Meng. A note on bivariate distributions that are conditionally normal.\n\nAmer. Statist., 45:125\u2013126, 1991.\n\n[11] B. C. Arnold, E. Castillo, J. M. Sarabia, and L. Gonzalez-Vega. Multiple modes in densities\n\nwith normal conditionals. Statist. Probab. Lett., 49:355\u2013363, 2000.\n\n[12] B. C. Arnold, E. Castillo, and J. M. Sarabia. Conditionally speci\ufb01ed distributions: An intro-\n\nduction. Statistical Science, 16(3):249\u2013274, 2001.\n\n[13] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online\nproblems. In Proceedings of The 31st International Conference on Machine Learning, pages\n100\u2013108, 2014.\n\n[14] Michal Valko, R\u00b4emi Munos, Branislav Kveton, and Tom\u00b4a\u02c7s Koc\u00b4ak. Spectral bandits for smooth\n\ngraph functions. In 31th International Conference on Machine Learning, 2014.\n\n[15] Tom\u00b4a\u02c7s Koc\u00b4ak, Michal Valko, R\u00b4emi Munos, and Shipra Agrawal. Spectral thompson sampling.\n\nIn Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[16] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. arXiv preprint\n\narXiv:1401.8257, 2014.\n\n[17] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In Proceedings of the 31th In-\nternational Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014,\npages 136\u2013144, 2014.\n\n9\n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Jaya", "family_name": "Kawale", "institution": "Adobe Research"}, {"given_name": "Hung", "family_name": "Bui", "institution": "Adobe Research"}, {"given_name": "Branislav", "family_name": "Kveton", "institution": "Adobe Research"}, {"given_name": "Long", "family_name": "Tran-Thanh", "institution": "University of Southampton"}, {"given_name": "Sanjay", "family_name": "Chawla", "institution": "Qatar Computing Research Institute,  HBKU  and University of Sydney"}]}