{"title": "Automatic Feature Induction for Stagewise Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 314, "page_last": 322, "abstract": "Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with non-constant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of state-of-the-art collaborative filtering algorithms.", "full_text": "Automatic Feature Induction\n\nfor Stagewise Collaborative Filtering\n\nJoonseok Leea, Mingxuan Suna, Seungyeon Kima, Guy Lebanona, b\n\na College of Computing, Georgia Institute of Technology, Atlanta, GA 30332\n\n{jlee716, msun3, seungyeon.kim}@gatech.edu, lebanon@cc.gatech.edu\n\nb Google Research, Mountain View, CA 94043\n\nAbstract\n\nRecent approaches to collaborative \ufb01ltering have concentrated on estimating an\nalgebraic or statistical model, and using the model for predicting missing ratings.\nIn this paper we observe that different models have relative advantages in differ-\nent regions of the input space. This motivates our approach of using stagewise\nlinear combinations of collaborative \ufb01ltering algorithms, with non-constant com-\nbination coef\ufb01cients based on kernel smoothing. The resulting stagewise model\nis computationally scalable and outperforms a wide selection of state-of-the-art\ncollaborative \ufb01ltering algorithms.\n\n1\n\nIntroduction\n\nRecent approaches to collaborative \ufb01ltering (CF) have concentrated on estimating an algebraic or\nstatistical model, and using the model for predicting the missing rating of user u on item i. We\ndenote CF methods as f (u, i), and the family of potential CF methods as F.\nEnsemble methods, which combine multiple models from F into a \u201cmeta-model\u201d, have been a\nsigni\ufb01cant research direction in classi\ufb01cation and regression. Linear combinations of K models\n\nK(cid:88)\n\nF (K)(x) =\n\n\u03b1kfk(x)\n\n(1)\n\nwhere \u03b11, . . . \u03b1K \u2208 R and f1, . . . , fK \u2208 F, such as boosting or stagewise linear regression and\nstagewise logistic regression, enjoy a signi\ufb01cant performance boost over the single top-performing\nmodel. This is not surprising since (1) includes as a degenerate case each of the models f \u2208 F by\nitself. Stagewise models are greedy incremental models of the form\n\nk=1\n\n(\u03b1k, fk) = arg min\n\u03b1k\u2208R,fk\u2208F\n\nRisk(F (k\u22121) + \u03b1kfk),\n\nk = 1, . . . , K,\n\n(2)\n\nwhere the parameters of F (K) are estimated one by one without modifying previously selected pa-\nrameters). Stagewise models have two important bene\ufb01ts: (a) a signi\ufb01cant resistance to over\ufb01tting,\nand (b) computational scalability to large data and high K.\nIt is somewhat surprising that ensemble methods have had relatively little success in the collaborative\n\ufb01ltering literature. Generally speaking, ensemble or combination methods have shown only a minor\nimprovement over the top-performing CF methods. The cases where ensemble methods did show\nan improvement (for example the Net\ufb02ix prize winner [10] and runner up), relied heavily on manual\nfeature engineering, manual parameter setting, and other tinkering.\nThis paper follows up on an experimental discovery: different recommendation systems perform\nbetter than others for some users and items but not for others. In other words, the relative strengths\nof two distinct CF models f1(u, i), f2(u, i) \u2208 F depend on the user u and the item i whose rating\n\n1\n\n\fFigure 1: Test set loss (mean absolute error) of two simple algorithms (user average and item aver-\nage) on items with different number of ratings.\n\nis being predicted. One example of two such systems appears in Figure 1 that graphs the test-set\nloss of two recommendation rules (user average and item average) as a function of the number of\navailable ratings for the recommended item i. The two recommendation rules outperform each other,\ndepending on whether the item in question has few or many ratings in the training data. We conclude\nfrom this graph and other comprehensive experiments [14] that algorithms that are inferior in some\ncircumstances may be superior in other circumstances.\nThe inescapable conclusion is that the weights \u03b1k in the combination should be functions of u and\ni rather than constants\n\nK(cid:88)\n\nk=1\n\nF (K)(u, i) =\n\n(3)\nwhere \u03b1k(u, i) \u2208 R and fk \u2208 F for k = 1, . . . , K. In this paper we explore the use of such models\nfor collaborative \ufb01ltering, where the weight functions \u03b1k(u, i) are learned from. A major part of our\ncontribution is a feature induction strategy to identify feature functions expressing useful locality\ninformation. Our experimental study shows that the proposed method outperforms a wide variety of\nstate-of-the-art and traditional methods, and also outperforms other CF ensemble methods.\n\n\u03b1k(u, i)fk(u, i)\n\n2 Related Work\n\nMany memory-based CF methods predict the rating of items based on the similarity of the test user\nand the training users [21, 3, 6]. Similarity measures include Pearson correlation [21] and Vector\ncosine similarity [3, 6]. Other memory-based CF methods includes item-based CF [25] and a non-\nparametric probabilistic model based on ranking preference similarities [28].\nModel-based CF includes user and item clustering [3, 29, 32], Bayesian networks [3], dependence\nnetwork [5] and probabilistic latent variable models [19, 17, 33]. Slope-one [16] achieved fast\nand reasonably accurate prediction. The state-of-the-art methods including the Net\ufb02ix competition\nwinner are based on matrix factorization. The factorized matrix can be used to \ufb01ll out the unobserved\nentries of the user-rating matrix in a way similar to latent factor analysis [20, 12, 9, 13, 24, 23, 11].\nSome recent work suggested that combining different CF models may improve the prediction ac-\ncuracy. Speci\ufb01cally, a memory-based method linearly combined with a latent factor method [1, 8]\nretains the advantages of both models. Ensembles of maximum margin matrix factorizations were\nexplored to improve the result of a single MMMF model in [4]. A mixture of experts model is\nproposed in [27] to linearly combine the prediction results of more than two models. In many cases,\nthere is signi\ufb01cant manual intervention such as setting the combination weights manually.\nFeature-weighted linear stacking [26] is the ensemble method most closely related to our approach.\nThe primary difference is the manual selection of features in [26] as opposed to automatic induction\nof local features in our paper that leads to a signi\ufb01cant improvement in prediction quality. Model\ncombination based on locality has been proposed in other machine learning topics, such as classi\ufb01-\ncation [31, 18] or sensitivity estimation [2].\n\n2\n\n40701001301601900.700.750.800.850.900.95Number of available ratings for target itemMAE User AverageItem Average\f3 Combination of CF Methods with Non-Constant Weights\n\nRecalling the linear combination (3) from Section 1, we de\ufb01ne non-constant combination weights\n\u03b1k(u, i) that are functions of the user and item that are being predicted. We propose the following\nalgebraic form\n\n\u03b2k \u2208 R,\n\nhk \u2208 H\n\n\u03b1k(u, i) = \u03b2k hk(u, i),\n\n(4)\nwhere \u03b2k is a parameter and hk is a function selected from a family H of candidate feature functions.\nThe combination (3) with non-constant weights (4) enables some CF methods fk to be emphasized\nfor some user-item combinations through an appropriate selection of the \u03b2k parameters. We assume\nthat H contains the constant function, capturing the constant-weight combination within our model.\nSubstituting (4) into (3) we get\n\nF (K)(u, i) =\n\n\u03b2k hk(u, i) fk(u, i),\n\n\u03b2k \u2208 R,\n\nhk \u2208 H,\n\nfk \u2208 F.\n\n(5)\n\nK(cid:88)\n\nk=1\n\nNote that since hk and fk are selected from the sets of CF methods and feature functions respectively,\nwe may have fj = fl or hj = hl for j (cid:54)= l. This is similar to boosting and other stagewise algorithms\nwhere one feature or base learner may be chosen multiple times, effectively updating its associate\nfeature functions and parameters. The total weight function associated with a particular f \u2208 F is\n\n(cid:80)\n\nk:fk=f \u03b2khk(u, i).\n\nA simple way to \ufb01t \u03b2 = (\u03b21, . . . , \u03b2K) is least squares\n\n(cid:88)\n\n(cid:16)\n\nu,i\n\n\u03b2\u2217 = arg min\n\n\u03b2\u2208C\n\nF (K)(u, i) \u2212 Ru,i\n\n(cid:17)2\n\n,\n\n(6)\n\n(cid:80)K\n(cid:80)\n\nwhere Ru,i denotes the rating of user u on item i in the training data and the summation ranges over\nall ratings in the training set. A variation of (6), where \u03b2 is constrained such that \u03b1k(u, i) \u2265 0, and\n\nk=1 \u03b1k(u, i) = 1 endows F with the following probabilistic interpretation\n\nwhere f represents a random draw from F, with probabilities p(f|u, i) proportional\n\n(7)\nto\nk:fk=f \u03b2khk(u, i). In contrast to standard combination models with \ufb01xed weights, (7) forms a\n\nF (u, i) = Ep {f | u, i} ,\n\nconditional expectation, rather than an expectation.\n\n4\n\nInducing Local Features\n\nIn contrast to [26] that manually de\ufb01ned 25 features, we induce the features hk from data. The\nfeatures hk(u, i) should emphasize users u and items i that are likely to lead to variations in the\nrelative strength of the f1, . . . , fK. We consider below two issues: (i) de\ufb01ning the set H of candidate\nfeatures, and (ii) a strategy for selecting features from H to add to the combination F .\n4.1 Candidate Feature Families H\nWe denote the sets of users and items by U and I respectively, and the domain of f \u2208 F and h \u2208 H\nas \u2126 = U \u00d7 I. The set R \u2282 \u2126 is the set of user-item pairs present in the training set, and the set of\nuser-item pairs that are being predicted is \u03c9 \u2208 \u2126 \\ R.\nWe consider the following three unimodal functions on \u2126, parameterized by a location parameter or\nmode \u03c9\u2217 = (u\u2217, i\u2217) \u2208 \u2126 and a bandwidth h > 0\n\n(cid:18)\n(cid:18)\n(cid:18)\n\n(cid:19)\n1 \u2212 d(u\u2217, u)\n(cid:19)\n1 \u2212 d(i\u2217, i)\n(cid:19)\n1 \u2212 d(u\u2217, u)\n\nh\n\nh\n\nh\n\nh,(u\u2217,i\u2217)(u, i) \u221d\nK (1)\nh,(u\u2217,i\u2217)(u, i) \u221d\nK (2)\nh,(u\u2217,i\u2217)(u, i) \u221d\nK (3)\n\nI (d(u\u2217, u) \u2264 h) ,\n\nI (d(i\u2217, i) \u2264 h) ,\n\nI (d(u\u2217, u) \u2264 h) \u00b7\n\n3\n\n(cid:18)\n1 \u2212 d(i\u2217, i)\n\nh\n\n(cid:19)\n\nI (d(i\u2217, i) \u2264 h) ,\n\n(8)\n\n\fwhere I(A) = 1 if A holds and 0 otherwise. The \ufb01rst function is unimodal in u, centered around\nu\u2217, and constant in i. The second function is unimodal in i, centered around i\u2217, and constant in u.\nThe third is unimodal in u, i and centered around (u\u2217, i\u2217).\nThere are several possible choices for the distance functions in (8) between users and between items.\nFor simplicity, we use in our experiments the angular distance\n\nd(x, y) = arc cos\n\n(9)\n\n(cid:19)\n\n(cid:18) (cid:104)x, y(cid:105)\n\n(cid:107)x(cid:107) \u00b7 (cid:107)y(cid:107)\n\nwhere the inner products above are computed based on the user-item rating matrix expressing the\ntraining set (ignoring entries not present in both arguments).\nThe functions (8) are the discrete analogs of the triangular kernel Kh(x) = h\u22121(1 \u2212 |x \u2212\nx\u2217|/h)I(|x \u2212 x\u2217| \u2264 h) used in non-parametric kernel smoothing [30]. Their values decay lin-\nearly with the distance from their mode (truncated at zero), and feature a bandwidth parameter h,\ncontrolling the rate of decay. As h increases the support size |{\u03c9 \u2208 \u2126 : K(\u03c9) > 0}| increases and\nmax\u03c9\u2208\u2126 K(\u03c9) decreases.\nThe unimodal feature functions (8) capture locality in the \u2126 space by measuring proximity to a\nmode, representing a user u\u2217, an item i\u2217, or a user-item pair. We de\ufb01ne the family of candidate\nfeatures H as all possible additive mixtures or max-mixtures of the functions (8), parameterized by\na set of multiple modes \u03c9\u2217 = {\u03c9\u2217\n\nK\u03c9\u2217\n\nj\n\n(u, i)\n\n(10)\n\nj=1\n\nK\u03c9\u2217 (u, i) \u221d max\n\n(11)\nUsing this de\ufb01nition, features functions hk(u, i) \u2208 H are able to express a wide variety of locality\ninformation involving multiple potential modes.\nWe discuss next the strategy for identifying useful features from H and adding them to the model F\nin a stagewise manner.\n\n(u, i).\n\nj=1,...,r\n\nK\u03c9\u2217\n\nj\n\n4.2 Feature Induction Strategy\n\nAdapting the stagewise learning approach to the model (5) we have\n\nr}\n1, . . . , \u03c9\u2217\n\nK\u03c9\u2217 (u, i) \u221d r(cid:88)\n\nK(cid:88)\n\nF (K)(u, i) =\n\n\u03b2k hk(u, i) fk(u, i),\n\nk=1\n\n(\u03b2k, hk, fk) =\n\narg min\n\n\u03b2k\u2208R,hk\u2208H,fk\u2208F\n\n(u,i)\u2208R\n\n(cid:88)\n\n(cid:16)\n\nF (k\u22121)(u, i) + \u03b2khk(u, i)fk(u, i) \u2212 Ru,i\n\n(12)\n\n(cid:17)2\n\n.\n\nIt is a well-known fact that stagewise algorithms sometimes outperform non-greedy algorithms due\nto resistance to over\ufb01tting (see [22], for example). This explains the good generalization ability of\nboosting and stage-wise linear regression.\nFrom a computational standpoint, (12) scales nicely with K and with the training set size. The one-\ndimensional quadratic optimization with respect to \u03b2 is solved via a closed form, but the optimization\nover F and H has to be done by brute force or by some approximate method such as sampling. The\ncomputational complexity of each iteration is thus O(|H| \u00b7 |F| \u00b7 |R|), assuming no approximation\nare performed.\nSince we consider relatively small families F of CF methods, the optimization over F does not pose\na substantial problem. The optimization over H is more problematic since H is potentially in\ufb01nite,\nor otherwise very large. We address this dif\ufb01culty by restricting H to a \ufb01nite collection of additive\nor max-mixtures kernels with r modes, randomly sampled from the users or items present in the\ntraining data. Our experiments conclude that it is possible to \ufb01nd useful features from a surprisingly\nsmall number of randomly-chosen samples.\n\n4\n\n\f5 Experiments\n\nWe describe below the experimental setting, followed by the experimental results and conclusions.\n\n5.1 Experimental Design\n\nWe used a recommendation algorithm toolkit PREA [15] for candidate algorithms, including three\nsimple baselines (Constant model, User Average, and Item Average) and \ufb01ve matrix-factorization\nmethods (Regularized SVD, NMF [13], PMF [24], Bayesian PMF [23], and Non-Linear PMF [12]),\nand Slope-one [16]. This list includes traditional baselines as well as state-of-the-art CF methods\nthat were proposed recently in the research literature. We evaluate the performance using the Root\nMean Squared Error (RMSE), measured on the test set.\nTable 1 lists 5 experimental settings. SINGLE runs each CF algorithm individually, and chooses the\none with the best average performance. CONST combines all candidate algorithms with constant\nweights as in (1). FWLS combines all candidate algorithms with non-constant weights as in (3)\n[26]. For CONST and FWLS, the weights are estimated from data by solving a least-square prob-\nlem. STAGE combines CF algorithms in stage-wise manner. FEAT applies the feature induction\ntechniques discussed in Section 4.\nTo evaluate whether the automatic feature induction in FEAT works better or worse than manually\nconstructed features, we used in FWLS and STAGE manual features similar to the ones in [26]\n(excluding features requiring temporal data). Examples include number of movies rated per user,\nnumber of users rating each movie, standard deviation of the users\u2019 ratings, and standard deviation\nof the item\u2019s ratings.\nThe feature induction in FEAT used a feature space H with additive multi-mode smoothing kernels\nas described in Section 4 (for simplicity we avoided kernels unimodal in both u and i). The family\nH included 200 randomly sampled features (a new sample was taken for each of the iterations in the\nstagewise algorithms). The r in (11) was set to 5% of user or item count, and bandwidth h values of\n0.05 (an extreme case where most features have value either 0 or 1) and 0.8 (each user or item has\nmoderate similarity values). The stagewise algorithm continues until either \ufb01ve consecutive trials\nfail to improve the RMSE on validation set, or the iteration number reaches 100, which occur only in\na few cases. We used similar L2 regularization for all methods (both stagewise and non-stagewise),\nwhere the regularization parameter was selected among 5 different values based on a validation set.\nWe experimented with the two standard MovieLens datasets: 100K and 1M, and with the Net\ufb02ix\ndataset.\nIn the Net\ufb02ix dataset experiments, we sub-sampled the data since (a) running state-of-\nthe-art candidate algorithms on the full Net\ufb02ix data takes too long time - for example, Bayesian\nPMF was reported to take 188 hours [23], and (b) it enables us to run extensive experiments mea-\nsuring the performance of the CF algorithms as a function of the number of users, number of\nitems, voting sparsity, and facilitates cross-validation and statistical tests. More speci\ufb01cally, we\nsub-sampled from the most active M users and the most often rated N items to obtain pre-speci\ufb01ed\ndata density levels |R|/|\u2126|. As shown in Table 2, we varied either the user or item count in the\nset {1000, 1500, 2000, 2500, 3000}, holding the other variable \ufb01xed at 1000 and the density at 1%,\nwhich is comparable density of the original Net\ufb02ix dataset. We also conducted an experiment where\nthe data density varied in the set {1%, 1.5%, 2%, 2.5%} with \ufb01xed user and item count of 1000 each.\nWe set aside a randomly chosen 20% for test set, and used the remaining 80% for both for training\nthe individual recommenders and for learning the ensemble model. It is possible, and perhaps more\nmotivated, to use two distinct train sets for the CF models and the ensemble. However, in our case,\nwe got high performance even in the case of using the same training dataset in both stages.\n\nMethod\nSINGLE\nCONST\nFWLS\nSTAGE\nFEAT\n\nC W S\n\nO\nO\nO\nO\n\nO\nO\nO\n\nI\n\nExplanation\nBest-performed single CF algorithm\nMixture of CF without features\nMixture of CF with manually-designed features\nO\nStagewise mixture with manual features\nO O Stagewise mixture with induced features\n\nTable 1: Experimental setting. (C: Combination of multiple algorithms, W: Weights varying with\nfeatures, S: Stage-wise algorithm, I: Induced features)\n\n5\n\n\fDataset\n\nUser Count\nItem Count\n\nDensity\n\nConstant\nUserAvg\nItemAvg\nSlope1\nRegSVD\nNMF\nPMF\nBPMF\nNLPMF\nSINGLE\nCONST\nFWLS\nSTAGE\nFEAT\np-Value\n\nF\nC\ne\nl\ng\nn\ni\n\nS\n\nd\ne\nn\ni\nb\nm\no\nC\n\n1000\n1000\n1.0%\n1.2188\n1.0566\n1.1260\n1.4490\n1.0623\n1.0784\n1.6180\n1.3973\n1.0561\n1.0561\n1.0429\n1.0288\n1.0036\n0.9862\n0.0028\n\n3000\n\n2000\n\n1000\n1.0%\n\nNet\ufb02ix\n1000\n\n1.0%\n\n2000\n\n3000\n\n1.2013\n1.0513\n1.0611\n1.4012\n1.0155\n1.0205\n1.4824\n1.2951\n1.0507\n1.0155\n1.0072\n1.0050\n0.9784\n0.9607\n0.0001\n\n1.2072\n1.0375\n1.0445\n1.3321\n1.0083\n1.0069\n1.4081\n1.2949\n1.0382\n1.0069\n0.9963\n0.9946\n0.9668\n0.9607\n0.0003\n\n1.1964\n1.0359\n1.1221\n1.4049\n1.0354\n1.0423\n1.4953\n1.2566\n1.0361\n1.0354\n1.0198\n1.0089\n0.9967\n0.9740\n0.0008\n\n1.1888\n1.0174\n1.1444\n1.3196\n1.0289\n1.0298\n1.4804\n1.2102\n1.0471\n1.0174\n1.0102\n1.0016\n0.9821\n0.9717\n0.0014\n\n1000\n1000\n2.0%\n1.2235\n1.0318\n1.1029\n1.3505\n1.0154\n1.0151\n1.3594\n1.2021\n1.0382\n1.0151\n0.9968\n0.9935\n0.9846\n0.9589\n0.0019\n\n1.5%\n1.2188\n1.0566\n1.1260\n1.4490\n1.0343\n1.0406\n1.4903\n1.3160\n1.0436\n1.0343\n1.0255\n1.0179\n0.9935\n0.9703\n0.0002\n\n2.5%\n1.2113\n1.0252\n1.0900\n1.0725\n1.0020\n1.0091\n1.1818\n1.1514\n1.0523\n1.0020\n0.9824\n0.9802\n0.9769\n0.9492\n0.0013\n\nMovieLens\n\n943\n1682\n6.3%\n1.2408\n1.0408\n1.0183\n0.9371\n0.9098\n0.9601\n0.9328\n0.9629\n0.9560\n0.9098\n0.9073\n0.9010\n0.8961\n0.8949\n0.0014\n\n6039\n3883\n4.3%\n1.2590\n1.0352\n0.9789\n0.9017\n0.8671\n0.9268\n0.9623\n0.9000\n0.9415\n0.8671\n0.8660\n0.8649\n0.8623\n0.8569\n0.0023\n\nTable 2: Test error in RMSE (lower values are better) for single CF algorithms used as candidates\nand combined models. Data where M or N is 1500 or 2500 are omitted due to the lack of space, as\nit is shown in Figure 2. The best-performing one in each group is indicated in Italic. The last row\nindicates p-value for statistical test of hypothesis FEAT (cid:31) FWLS.\n\nFigure 2: Performance trend with varied user count (left), item count (middle), and density (right)\non Net\ufb02ix dataset.\n\nFor stagewise methods, the 80% train set was divided to 60% training set and 20% validation set,\nused to determine when to stop the stagewise addition process. The non-stagewise methods used the\nentire 80% for training. The 10% of training set is used to select regularization parameter for both\nstagewise and non-stagewise. The results were averaged over 10 random data samples.\n\n6 Result and Discussion\n\n6.1 Performance Analysis and Example\n\nTable 2 displays the performance in RMSE of each combination method, as well as the individual\nalgorithms. Examining it, we observe the following partial order with respect to prediction accuracy:\nFEAT (cid:31) STAGE (cid:31) FWLS (cid:31) CONST (cid:31) SINGLE.\n\n\u2022 FWLS (cid:31) CONST (cid:31) SINGLE: Combining CF algorithms (even only with constant\nweights) produces better prediction than the best-single CF method. Also, using non-\nconstant weights improves performance further. This result is consistent with what has\nbeen known in literature [7, 26].\n\u2022 STAGE (cid:31) FWLS: Figure 2 indicates that stagewise combinations where features are cho-\nsen with replacement are more accurate. The selection with replacement allow certain\nfeatures to be selected more than once, correcting a previous inaccurate parameter setting.\n\u2022 FEAT (cid:31) STAGE: Making use of induced features improves prediction accuracy further\n\nfrom stagewise optimization with manually-designed features.\n\nOverall, our experiments indicate that the combination with non-constant weights and feature in-\nduction (FEAT) outperforms three baselines (the best single method, standard combinations with\nconstant weight, and the FWLS method using manually constructed features [26]). We tested the\n\n6\n\n100015002000250030000.940.960.981.001.021.041.061.081.10User CountRMSE 100015002000250030000.940.960.981.001.021.041.061.081.10Item CountRMSE 1.0%1.5%2.0%2.5%0.940.960.981.001.021.041.061.081.10DensityRMSE SINGLECONSTFWLSSTAGEFEAT\fFigure 3: Average weight values of each item (top) and user (bottom) sorted in the order of high to\nlow weight of selected algorithm. Note that the sorting order is similar between algorithm 1 (User\nAverage) and 2 (Item Average). In contrast, algorithm 8 (NLPMF) has opposite order, which will be\nweighted higher in different part of the data, compared to algorithm 1 and 2.\n\nhypothesis RM SEF EAT < RM SEF W LS with paired t-test. Based on the p-values (See the last\nrow in Table 2), we can reject the null hypothesis with signi\ufb01cance of 99%. We conclude that\nour proposed combination outperforms state-of-the-art methods, and several previously proposed\ncombination methods.\nTo see how feature induction works in detail, we illustrate an example with a case where the user\ncount and item count equals 1000. Figure 3 shows the average weight distribution that each user or\nitem receives under three CF methods: user average, item average, and NLPMF. We focused on these\nthree methods since they are frequently selected by the stagewise algorithm. The x axis variables\nin the three panels are sorted in the order of decreasing weights of selected algorithm. Note that in\neach \ufb01gure, one curve is monotonically decaying, showing the weights of the CF method according\nto which the sorting was done. An interesting observation is that algorithm 1 (User Average) and\nalgorithm 2 (Item Average) have similar pattern of sorting order in Figure 3 (right column). In other\nwords, these two algorithms are similar in nature, and are relatively strong or weaker in similar\nregions of \u2126. Algorithm 8 (NLPMF) on the other hand, has a very different relative strength pattern.\n\n6.2 Trend Analysis\n\nFigure 2 graphs the RMSE of the different combination methods as a function of the user count,\nitem count, and density. We make the following observations.\n\nimproves with the user count, item count, and density.\n\n\u2022 As expected, prediction accuracy for all combination methods and for the top single method\n\u2022 The performance gap between the best single algorithm and the combinations tends to\ndecrease with larger user and item count. This is a manifestation of the law of diminish-\ning returns, and the fact that the size of a suitable family H capturing locality informa-\ntion increases with the user and item count. Thus, the stagewise procedure becomes more\nchallenging computationally, and less accurate since in our experiment we sampled same\nnumber of compositions from H, rather than increasing it for larger data.\n\u2022 We note that all combination methods and the single best CF method improve performance\nas the density increases. The improvement seems to be the most pronounced for the single\nbest algorithm and for the FEAT method, indicating that FEAT scales up its performance\naggressively with increasing density levels.\n\n7\n\n02004006008001000\u22120.4\u22120.200.20.40.60.81item (sorted by algorithm 1)weight alg 1alg 2alg 802004006008001000\u22120.4\u22120.200.20.40.60.81item (sorted by algorithm 2) alg 1alg 2alg 802004006008001000\u22120.4\u22120.200.20.40.60.81item (sorted by algorithm 8) alg 1alg 2alg 802004006008001000\u22120.100.10.20.30.40.50.60.7user (sorted by algorithm 1)weight alg 1alg 2alg 802004006008001000\u22120.100.10.20.30.40.50.60.7user (sorted by algorithm 2) alg 1alg 2alg 802004006008001000\u22120.100.10.20.30.40.50.60.7user (sorted by algorithm 8) alg 1alg 2alg 8\f\u2022 Comparing the left and middle panels of Figure 2 implies that having more users is more\ninformative than having more items. In other words, if the total dataset size M \u00d7 N is\nequal, the performance tends to be better when M > N (left panel of Figure 2) than\nM < N (middle panel of Figure 2).\n\n6.3 Scalability\n\nOur proposed stagewise algorithm is very ef\ufb01cient, when compared to other feature selection algo-\nrithms such as step-wise or subset selection. Nevertheless, the large number of possible features may\nresult in computational issues. In our experiments, we sampled from the space of candidate features\na small subset of features that was considered for addition (the random subset is different in each\niteration of the stagewise algorithm). In the limit K \u2192 \u221e, such a sampling scheme would recover\nthe optimal ensemble as each feature will be selected for consideration in\ufb01nitely often. Our experi-\nments conclude that this scheme works well also in practice and results in signi\ufb01cant improvement\nto the state-of-the-art even for a relatively small sample of feature candidates such as 200. Viewed\nfrom another perspective, this implies that randomly selecting such a small subset of features each\niteration ensures the selection of useful features. In fact, the features induced in this manner were\nfound to be more useful than the manually crafted features in the FWLS algorithm [26].\n\n7 Summary\n\nWe started from an observation that the relative performance of different candidate recommendation\nsystems f (u, i) depends on u and i, for example on the activity level of user u and popularity of item\ni. This motivated the development of combination of recommendation systems with non-constant\nweights that emphasize different candidates based on their relative strengths in the feature space.\nIn contrast to the FWLS method that focused on manual construction of features, we developed a\nfeature induction algorithm that works in conjunction with stagewise least-squares. We formulate a\nfamily of feature function, based on the discrete analog of triangular kernel smoothing. This family\ncaptures a wide variety of local information and is thus able to model the relative strengths of the\ndifferent CF methods and how they change across \u2126.\nThe combination with induced features outperformed any of the base candidates as well as other\ncombination methods in literature. This includes the recently proposed FWLS method that uses\nmanually constructed feature function. As our candidates included many of the recently proposed\nstate-of-the-art recommendation systems our conclusions are signi\ufb01cant for the engineering com-\nmunity as well as recommendation system scientists.\n\nReferences\n[1] R. Bell, Y. Koren, and C. Volinsky. Modeling relationships at multiple scales to improve\n\naccuracy of large recommender systems. In Proc. of the ACM SIGKDD, 2007.\n\n[2] P. Bennett. Neighborhood-based local sensitivity.\n\nMachine Learning, 2007.\n\nIn Proc. of the European Conference on\n\n[3] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for col-\n\nlaborative \ufb01ltering. In Uncertainty in Arti\ufb01cial Intelligence, 1998.\n\n[4] D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factoriza-\n\ntions. In Proc. of the International Conference on Machine Learning, 2006.\n\n[5] D. Heckerman, D. Maxwell Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency\nnetworks for inference, collaborative \ufb01ltering, and data visualization. Journal of Machine\nLearning Research, 1, 2000.\n\n[6] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for\n\nperforming collaborative \ufb01ltering. In Proc. of ACM SIGIR Conference, 1999.\n\n[7] M. Jahrer, A. T\u00a8oscher, and R. Legenstein. Combining predictions for accurate recommender\n\nsystems. In Proc. of the ACM SIGKDD, 2010.\n\n[8] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model.\n\nIn Proc. of the ACM SIGKDD, 2008.\n\n8\n\n\f[9] Y. Koren. Factor in the neighbors: Scalable and accurate collaborative \ufb01ltering. ACM Trans-\n\nactions on Knowledge Discovery from Data, 4(1):1\u201324, 2010.\n\n[10] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8):30\u201337, 2009.\n\n[11] B. Lakshminarayanan, G. Bouchard, and C. Archambeau. Robust bayesian matrix factorisa-\n\ntion. In Proc. of the International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[12] N. D. Lawrence and R. Urtasun. Non-linear matrix factorization with gaussian processes. In\n\nProc. of the International Conference on Machine Learning, 2009.\n\n[13] D. Lee and H. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural\n\nInformation Processing Systems, 2001.\n\n[14] J. Lee, M. Sun, and G. Lebanon. A comparative study of collaborative \ufb01ltering algorithms.\n\nArXiv Report 1205.3193, 2012.\n\n[15] J. Lee, M. Sun, and G. Lebanon. Prea: Personalized recommendation algorithms toolkit.\n\nJournal of Machine Learning Research, 13:2699\u20132703, 2012.\n\n[16] D. Lemire and A. Maclachlan. Slope one predictors for online rating-based collaborative \ufb01l-\n\ntering. Society for Industrial Mathematics, 5:471\u2013480, 2005.\n\n[17] B. Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering.\n\nInformation Processing Systems, 2004.\n\nIn Advances in Neural\n\n[18] C. J. Merz. Dynamical selection of learning algorithms. Lecture Notes in Statistics, pages\n\n281\u2013290, 1996.\n\n[19] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles. Collaborative \ufb01ltering by person-\nIn Uncertainty in Arti\ufb01cial\n\nality diagnosis: A hybrid memory- and model-based approach.\nIntelligence, 2000.\n\n[20] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In Proc. of the International Conference on Machine Learning, 2005.\n\n[21] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open architecture\n\nfor collaborative \ufb01ltering of netnews. In Proc. of the Conference on CSCW, 1994.\n\n[22] L. Reyzin and R. E. Schapire. How boosting the margin can also boost classi\ufb01er complexity.\n\nIn Proc. of the International Conference on Machine Learning, 2006.\n\n[23] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using markov chain\n\nmonte carlo. In Proc. of the International Conference on Machine Learning, 2008.\n\n[24] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization.\n\nInformation Processing Systems, 2008.\n\nIn Advances in Neural\n\n[25] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative \ufb01ltering recommen-\n\ndation algorithms. In Proc. of the International Conference on World Wide Web, 2001.\n\n[26] J. Sill, G. Takacs, L. Mackey, and D. Lin. Feature-weighted linear stacking. Arxiv Report\n\narXiv:0911.0460, 2009.\n\n[27] X. Su, R. Greiner, T. M. Khoshgoftaar, and X. Zhu. Hybrid collaborative \ufb01ltering algorithms\nusing a mixture of experts. In Proc. of the IEEE/WIC/ACM International Conference on Web\nIntelligence, 2007.\n\n[28] M. Sun, G. Lebanon, and P. Kidwell. Estimating probabilities in recommendation systems. In\n\nProc. of the International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[29] L. H. Ungar and D. P. Foster. Clustering methods for collaborative \ufb01ltering. In AAAI Workshop\n\non Recommendation Systems, 1998.\n\n[30] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall/CRC, 1995.\n[31] K. Woods, W.P. Kegelmeyer Jr, and K. Bowyer. Combination of multiple classi\ufb01ers using\nlocal accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n19(4):405\u2013410, 1997.\n\n[32] G. R. Xue, C. Lin, Q. Yang, W. S. Xi, H. J. Zeng, Y. Yu, and Z. Chen. Scalable collaborative\n\n\ufb01ltering using cluster-based smoothing. In Proc. of ACM SIGIR Conference, 2005.\n\n[33] K. Yu, S. Zhu, J. Lafferty, and Y. Gong. Fast nonparametric matrix factorization for large-scale\n\ncollaborative \ufb01ltering. In Proc. of ACM SIGIR Conference, 2009.\n\n9\n\n\f", "award": [], "sourceid": 178, "authors": [{"given_name": "Joonseok", "family_name": "Lee", "institution": null}, {"given_name": "Mingxuan", "family_name": "Sun", "institution": null}, {"given_name": "Seungyeon", "family_name": "Kim", "institution": ""}, {"given_name": "Guy", "family_name": "Lebanon", "institution": ""}]}