{"title": "Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3697, "page_last": 3707, "abstract": "Matrix completion models are among the most common formulations of recommender systems. Recent works have showed a boost of performance of these techniques when introducing the pairwise relationships between users/items in the form of graphs, and imposing smoothness priors on these graphs. However, such techniques do not fully exploit the local stationary structures on user/item graphs, and the number of parameters to learn is linear w.r.t. the number of users and items. We propose a novel approach to overcome these limitations by using geometric deep learning on graphs. Our matrix completion architecture combines a novel multi-graph convolutional neural network that can learn meaningful statistical graph-structured patterns from users and items, and a recurrent neural network that applies a learnable diffusion on the score matrix. Our neural network system is computationally attractive as it requires a constant number of parameters independent of the matrix size. We apply our method on several standard datasets, showing that it outperforms state-of-the-art matrix completion techniques.", "full_text": "Geometric Matrix Completion with Recurrent\n\nMulti-Graph Neural Networks\n\nFederico Monti\n\nUniversit\u00e0 della Svizzera italiana\n\nLugano, Switzerland\n\nfederico.monti@usi.ch\n\nMichael M. Bronstein\n\nUniversit\u00e0 della Svizzera italiana\n\nLugano, Switzerland\n\nmichael.bronstein@usi.ch\n\nXavier Bresson\n\nSchool of Computer Science and Engineering\n\nNTU, Singapore\n\nxbresson@ntu.edu.sg\n\nAbstract\n\nMatrix completion models are among the most common formulations of recom-\nmender systems. Recent works have showed a boost of performance of these\ntechniques when introducing the pairwise relationships between users/items in the\nform of graphs, and imposing smoothness priors on these graphs. However, such\ntechniques do not fully exploit the local stationary structures on user/item graphs,\nand the number of parameters to learn is linear w.r.t. the number of users and items.\nWe propose a novel approach to overcome these limitations by using geometric\ndeep learning on graphs. Our matrix completion architecture combines a novel\nmulti-graph convolutional neural network that can learn meaningful statistical\ngraph-structured patterns from users and items, and a recurrent neural network that\napplies a learnable diffusion on the score matrix. Our neural network system is\ncomputationally attractive as it requires a constant number of parameters indepen-\ndent of the matrix size. We apply our method on several standard datasets, showing\nthat it outperforms state-of-the-art matrix completion techniques.\n\nIntroduction\n\n1\nRecommender systems have become a central part of modern intelligent systems. Recommending\nmovies on Net\ufb02ix, friends on Facebook, furniture on Amazon, and jobs on LinkedIn are a few\nexamples of the main purpose of these systems. Two major approaches to recommender systems are\ncollaborative [5] and content [32] \ufb01ltering techniques. Systems based on collaborative \ufb01ltering use\ncollected ratings of items by users and offer new recommendations by \ufb01nding similar rating patterns.\nSystems based on content \ufb01ltering make use of similarities between items and users to recommend\nnew items. Hybrid systems combine collaborative and content techniques.\nMatrix completion. Mathematically, a recommendation method can be posed as a matrix com-\npletion problem [9], where columns and rows represent users and items, respectively, and matrix\nvalues represent scores determining whether a user would like an item or not. Given a small subset of\nknown elements of the matrix, the goal is to \ufb01ll in the rest. A famous example is the Net\ufb02ix challenge\n[22] offered in 2009 and carrying a 1M$ prize for the algorithm that can best predict user ratings for\nmovies based on previous user ratings. The size of the Net\ufb02ix matrix is 480k movies \u00d7 18k users\n(8.5B entries), with only 0.011% known entries.\nRecently, there have been several attempts to incorporate geometric structure into matrix completion\nproblems [27, 19, 33, 24], e.g. in the form of column and row graphs representing similarity of users\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fand items, respectively. Such additional information de\ufb01nes e.g. the notion of smoothness of the\nmatrix and was shown bene\ufb01cial for the performance of recommender systems. These approaches\ncan be generally related to the \ufb01eld of signal processing on graphs [37], extending classical harmonic\nanalysis methods to non-Euclidean domains (graphs).\nGeometric deep learning. Of key interest to the design of recommender systems are deep learning\napproaches. In the recent years, deep neural networks and, in particular, convolutional neural networks\n(CNNs) [25] have been applied with great success to numerous applications. However, classical CNN\nmodels cannot be directly applied to the recommendation problem to extract meaningful patterns in\nusers, items and ratings because these data are not Euclidean structured, i.e. they do not lie on regular\nlattices like images but rather irregular domains like graphs. Recent works applying deep learning to\nrecommender systems used networks with fully connected or auto-encoder architectures [44, 35, 14].\nSuch methods are unable to extract the important local stationary patterns from the data, which is\none of the key properties of CNN architectures. New neural networks are necessary and this has\nmotivated the recent development of geometric deep learning techniques that can mathematically deal\nwith graph-structured data, which arises in numerous applications, ranging from computer graphics\nand vision [28, 2, 4, 3, 30] to chemistry [12]. We recommend the review paper [6] to the reader not\nfamiliar with this line of works.\nThe earliest attempts to apply neural networks to graphs are due to Scarselli et al. [13, 34] (see more\nrecent formulation [26, 40]). Bruna et al. [7, 15] formulated CNN-like deep neural architectures\non graphs in the spectral domain, employing the analogy between the classical Fourier transforms\nand projections onto the eigenbasis of the graph Laplacian operator [37]. Defferrard et al. [10]\nproposed an ef\ufb01cient \ufb01ltering scheme using recurrent Chebyshev polynomials, which reduces the\ncomplexity of CNNs on graphs to the same complexity of classical (Euclidean) CNNs. This model\nwas later extended to deal with dynamic data [36]. Kipf and Welling [21] proposed a simpli\ufb01cation of\nChebychev networks using simple \ufb01lters operating on 1-hop neighborhoods of the graph. Monti et al.\n[30] introduced a spatial-domain generalization of CNNs to graphs local patch operators represented\nas Gaussian mixture models, showing signi\ufb01cantly better generalization across different graphs.\nContributions. We present two main contributions. First, we introduce a new multi-graph CNN\narchitecture that generalizes [10] to multiple graphs. This new architecture is able to extract local\nstationary patterns from signals de\ufb01ned on multiple graphs simultaneously. While in this work we\napply multi-graph CNNs in the context of recommender systems to the graphs of users and items,\nhowever, our architecture is generic and can be used in other applications, such as neuroscience\n(autism detection with network of people and brain connectivity [31, 23]), computer graphics (shape\ncorrespondence on product manifold [41]), or social network analysis (abnormal spending behavior\ndetection with graphs of customers and stores [39]). Second, we approach the matrix completion\nproblem as learning on user and item graphs using the new deep multi-graph CNN framework. Our\narchitecture is based on a cascade of multi-graph CNN followed by Long Short-Term Memory\n(LSTM) recurrent neural network [16] that together can be regarded as a learnable diffusion process\nthat reconstructs the score matrix.\n\n2 Background\n\n2.1 Matrix Completion\nMatrix completion problem. Recovering the missing values of a matrix given a small fraction\nof its entries is an ill-posed problem without additional mathematical constraints on the space of\nsolutions. It is common to assume that the variables lie in a smaller subspace, i.e., the matrix is of\nlow rank,\n\nrank(X)\n\nmin\nX\n\ns.t. xij = yij, \u2200ij \u2208 \u2126,\n\nwhere X denotes the matrix to recover, \u2126 is the set of the known entries and yij are their values.\nUnfortunately, rank minimization turns out to be an NP-hard combinatorial problem that is computa-\ntionally intractable in practical cases. The tightest possible convex relaxation of problem (1) is to\nreplace the rank with the nuclear norm (cid:107) \u00b7 (cid:107)(cid:63) equal to the sum of its singular values [8],\n\n(1)\n\n(2)\n\nmin\nX\n\n(cid:107)X(cid:107)(cid:63) +\n\n(cid:107)\u2126 \u25e6 (X \u2212 Y)(cid:107)2\nF;\n\n\u00b5\n2\n\nthe equality constraint is also replaced with a penalty to make the problem more robust to noise (here\n\u2126 is the indicator matrix of the known entries \u2126 and \u25e6 denotes the Hadamard pointwise product).\n\n2\n\n\fij = wc\n\nji, wc\n\nij = 0 if (i, j) /\u2208 Ec and wc\n\nCand\u00e8s and Recht [8] proved that under some technical conditions the solutions of problems (2)\nand (1) coincide.\nGeometric matrix completion An alternative relaxation of the rank operator in (1) can be achieved\nconstraining the space of solutions to be smooth w.r.t. some geometric structure on the rows and\ncolumns of the matrix [27, 19, 33, 1]. The simplest model is proximity structure represented as an\nundirected weighted column graph Gc = ({1, . . . , n},Ec, Wc) with adjacency matrix Wc = (wc\nij),\nij > 0 if (i, j) \u2208 Ec. In our setting, the column\nwhere wc\ngraph could be thought of as a social network capturing relations between users and the similarity of\ntheir tastes. The row graph Gr = ({1, . . . , m},Er, Wr) representing the items similarities is de\ufb01ned\nsimilarly.\nOn each of these graphs one can construct the (normalized) graph Laplacian, an n \u00d7 n symmetric\nj(cid:54)=i wij) is the degree\nmatrix. We denote the Laplacian associated with row and column graphs by \u2206r and \u2206c, respec-\ntively. Considering the columns (respectively, rows) of matrix X as vector-valued functions on the\ncolumn graph Gc (respectively, row graph Gr), their smoothness can be expressed as the Dirichlet\n= trace(X(cid:62)\u2206rX) (respecitvely, (cid:107)X(cid:107)2Gc\nnorm (cid:107)X(cid:107)2Gr\n= trace(X\u2206cX(cid:62))). The geometric matrix\ncompletion problem [19] thus boils down to minimizing\n\npositive-semide\ufb01nite matrix \u2206 = I \u2212 D\u22121/2WD\u22121/2, where D = diag((cid:80)\n\nmin\nX\n\n(cid:107)X(cid:107)2Gr\n\n+ (cid:107)X(cid:107)2Gc\n\n+\n\n(cid:107)\u2126 \u25e6 (X \u2212 Y)(cid:107)2\nF.\n\n\u00b5\n2\n\n(3)\n\nFactorized models. Matrix completion algorithms introduced in the previous section are well-posed\nas convex optimization problems, guaranteeing existence, uniqueness and robustness of solutions.\nBesides, fast algorithms have been developed for the minimization of the non-differentiable nuclear\nnorm. However, the variables in this formulation are the full m \u00d7 n matrix X, making it hard to scale\nup to large matrices such as the Net\ufb02ix challenge.\nA solution is to use a factorized representation [38, 22, 27, 43, 33, 1] X = WH(cid:62), where W, H are\nm \u00d7 r and n \u00d7 r matrices, respectively, with r (cid:28) min(m, n). The use of factors W, H reduces the\nnumber of degrees of freedom from O(mn) to O(m + n); this representation is also attractive as\npeople often assumes the original matrix to be low-rank for solving the matrix completion problem,\nand rank(WH(cid:62)) \u2264 r by construction.\nThe nuclear norm minimization problem (2) can be rewritten in a factorized form as [38]:\n\nmin\nW,H\n\n(cid:107)W(cid:107)2\n\nF +\n\n1\n2\n\n(cid:107)H(cid:107)2\n\nF +\n\n1\n2\n\n\u00b5\n2\n\n(cid:107)\u2126 \u25e6 (WH(cid:62) \u2212 Y)(cid:107)2\nF.\n\nand the factorized formulation of the graph-based minimization problem (3) as\n\nmin\nW,H\n\n(cid:107)W(cid:107)2Gr\n\n+\n\n1\n2\n\n(cid:107)H(cid:107)2Gc\n\n+\n\n1\n2\n\n\u00b5\n2\n\n(cid:107)\u2126 \u25e6 (WH(cid:62) \u2212 Y)(cid:107)2\nF.\n\n(4)\n\n(5)\n\nThe limitation of model (5) is that it decouples the regularization previously applied simultaneously\non the rows and columns of X in (3), but the advantage is linear instead of quadratic complexity.\n2.2 Deep learning on graphs\nThe key concept underlying our work is geometric deep learning, an extension of CNNs to graphs. In\nparticular, we focus here on graph CNNs formulated in the spectral domain. A graph Laplacian admits\na spectral eigendecomposition of the form \u2206 = \u03a6\u039b\u03a6(cid:62), where \u03a6 = (\u03c61, . . . \u03c6n) denotes the matrix\nof orthonormal eigenvectors and \u039b = diag(\u03bb1, . . . , \u03bbn) is the diagonal matrix of the corresponding\neigenvalues. The eigenvectors play the role of Fourier atoms in classical harmonic analysis and the\neigenvalues can be interpreted as frequencies. Given a function x = (x1, . . . , xn)(cid:62) on the vertices\nof the graph, its graph Fourier transform is given by \u02c6x = \u03a6(cid:62)x. The spectral convolution of two\nfunctions x, y can be de\ufb01ned as the element-wise product of the respective Fourier transforms,\n\nx (cid:63) y = \u03a6(\u03a6(cid:62)y) \u25e6 (\u03a6(cid:62)x) = \u03a6 diag(\u02c6y1, . . . , \u02c6yn) \u02c6x,\n\n(6)\n\nby analogy to the Convolution Theorem in the Euclidean case.\nBruna et al. [7] used the spectral de\ufb01nition of convolution (6) to generalize CNNs on graphs. A\nspectral convolutional layer in this formulation has the form\n\n\u02dcxl = \u03be\n\nl = 1, . . . , q,\n\n(7)\n\n\uf8eb\uf8ed q(cid:48)(cid:88)\n\nl(cid:48)=1\n\n\uf8f6\uf8f8 ,\n\n\u03a6 \u02c6Yll(cid:48)\u03a6(cid:62)xl(cid:48)\n\n3\n\n\frespectively,\n\ninput and output channels,\n\nwhere q(cid:48), q denote the number of\n\u02c6Yll(cid:48) =\ndiag(\u02c6yll(cid:48),1, . . . , \u02c6yll(cid:48),n) is a diagonal matrix of spectral multipliers representing a learnable \ufb01lter\nin the spectral domain, and \u03be is a nonlinearity (e.g. ReLU) applied on the vertex-wise function values.\nUnlike classical convolutions carried out ef\ufb01ciently in the spectral domain using FFT, the computa-\ntions of the forward and inverse graph Fourier transform incur expensive O(n2) multiplication by\nthe matrices \u03a6, \u03a6(cid:62), as there are no FFT-like algorithms on general graphs. Second, the number of\nparameters representing the \ufb01lters of each layer of a spectral CNN is O(n), as opposed to O(1) in\nclassical CNNs. Third, there is no guarantee that the \ufb01lters represented in the spectral domain are\nlocalized in the spatial domain, which is another important property of classical CNNs.\nHenaff et al. [15] argued that spatial localization can be achieved by forcing the spectral mul-\ntipliers to be smooth. The \ufb01lter coef\ufb01cients are represented as \u02c6yk = \u03c4 (\u03bbk), where \u03c4 (\u03bb) is a\nsmooth transfer function of frequency \u03bb; its application to a signal x is expressed as \u03c4 (\u2206)x =\n\u03a6 diag(\u03c4 (\u03bb1), . . . , \u03c4 (\u03bbn))\u03a6(cid:62)x, where applying a function to a matrix is understood in the operator\nsense and boils down to applying the function to the matrix eigenvalues. In particular, the authors\nused parametric \ufb01lters of the form\n\n(8)\nwhere \u03b21(\u03bb), . . . , \u03b2r(\u03bb) are some \ufb01xed interpolation kernels, and \u03b8 = (\u03b81, . . . , \u03b8p) are p = O(1)\ninterpolation coef\ufb01cients acting as parameters of the spectral convolutional layer.\nDefferrard et al. [10] used polynomial \ufb01lters of order p represented in the Chebyshev basis,\n\n\u03b8j\u03b2j(\u03bb),\n\n\u03c4\u03b8(\u03bb) =\n\np(cid:88)\n\nj=1\n\np(cid:88)\n\n\u03c4\u03b8(\u02dc\u03bb) =\n\n\u03b8jTj(\u02dc\u03bb),\n\n(9)\n\nj=0\n\nLaplacian eigenvectors, as applying a Chebyshev \ufb01lter to x amounts to \u03c4\u03b8( \u02dc\u2206)x =(cid:80)p\n\nwhere \u02dc\u03bb is frequency rescaled in [\u22121, 1], \u03b8 is the (p+1)-dimensional vector of polynomial coef\ufb01cients\nparametrizing the \ufb01lter, and Tj(\u03bb) = 2\u03bbTj\u22121(\u03bb) \u2212 Tj\u22122(\u03bb) denotes the Chebyshev polynomial of\nn \u2206 \u2212 I is\ndegree j de\ufb01ned in a recursive manner with T1(\u03bb) = \u03bb and T0(\u03bb) = 1. Here, \u02dc\u2206 = 2\u03bb\u22121\nthe rescaled Laplacian with eigenvalues \u02dc\u039b = 2\u03bb\u22121\nThis approach bene\ufb01ts from several advantages. First, it does not require an explicit computation of the\nj=0 \u03b8jTj( \u02dc\u2206)x;\ndue to the recursive de\ufb01nition of the Chebyshev polynomials, this incurs applying the Laplacian p\ntimes. Multiplication by Laplacian has the cost of O(|E|), and assuming the graph has |E| = O(n)\nedges (which is the case for k-nearest neighbors graphs and most real-world networks), the overall\ncomplexity is O(n) rather than O(n2) operations, similarly to classical CNNs. Moreover, since the\nLaplacian is a local operator affecting only 1-hop neighbors of a vertex and accordingly its pth power\naffects the p-hop neighborhood, the resulting \ufb01lters are spatially localized.\n\nn \u039b \u2212 I in the interval [\u22121, 1].\n\n3 Our approach\nIn this paper, we propose formulating matrix completion as a problem of deep learning on user and\nitem graphs. We consider two architectures summarized in Figures 1 and 2. The \ufb01rst architecture\nworks on the full matrix model producing better accuracy but requiring higher complexity. The\nsecond architecture used factorized matrix model, offering better scalability at the expense of slight\nreduction of accuracy. For both architectures, we consider a combination of multi-graph CNN and\nRNN, which will be described in detail in the following sections. Multi-graph CNNs are used to\nextract local stationary features from the score matrix using row and column similarities encoded by\nuser and item graphs. Then, these spatial features are fed into a RNN that diffuses the score values\nprogressively, reconstructing the matrix.\n3.1 Multi-Graph CNNs\nMulti-graph convolution. Our \ufb01rst goal is to extend the notion of the aforementioned graph\nFourier transform to matrices whose rows and columns are de\ufb01ned on row- and column-graphs. We\nrecall that the classical two-dimensional Fourier transform of an image (matrix) can be thought of as\napplying a one-dimensional Fourier transform to its rows and columns. In our setting, the analogy of\nthe two-dimensional Fourier transform has the form\n\u02c6X = \u03a6(cid:62)\n\n(10)\n\nr X\u03a6c\n\n4\n\n\fwhere \u03a6c, \u03a6r and \u039bc = diag(\u03bbc,1, . . . , \u03bbc,n) and \u039br = diag(\u03bbr,1, . . . , \u03bbr,m) denote the n \u00d7 n and\nm \u00d7 m eigenvector- and eigenvalue matrices of the column- and row-graph Laplacians \u2206c, \u2206r,\nrespectively. The multi-graph version of the spectral convolution (6) is given by\n\nX (cid:63) Y = \u03a6r( \u02c6X \u25e6 \u02c6Y)\u03a6(cid:62)\nc ,\n\n(11)\nand in the classical setting can be thought as the analogy of \ufb01ltering a 2D image in the spectral domain\n(column and row graph eigenvalues \u03bbc and \u03bbr generalize the x- and y-frequencies of an image).\nAs in [7], representing multi-graph \ufb01lters as their spectral multipliers \u02c6Y would yield O(mn) parame-\nters, prohibitive in any practical application. To overcome this limitation, we follow [15], assuming\nthat the multi-graph \ufb01lters are expressed in the spectral domain as a smooth function of both frequen-\ncies (eigenvalues \u03bbc, \u03bbr of the row- and column graph Laplacians) of the form \u02c6Yk,k(cid:48) = \u03c4 (\u03bbc,k, \u03bbr,k(cid:48)).\nIn particular, using Chebychev polynomial \ufb01lters of degree p,1\n\n\u03c4\u0398(\u02dc\u03bbc, \u02dc\u03bbr) =\n\n\u03b8jj(cid:48)Tj(\u02dc\u03bbc)Tj(cid:48)(\u02dc\u03bbr),\n\n(12)\n\np(cid:88)\n\nj,j(cid:48)=0\n\np(cid:88)\n\nj,j(cid:48)=0\n\nwhere \u02dc\u03bbc, \u02dc\u03bbr are the frequencies rescaled [\u22121, 1] (see Figure 4 for examples). Such \ufb01lters are\nparametrized by a (p + 1) \u00d7 (p + 1) matrix of coef\ufb01cients \u0398 = (\u03b8jj(cid:48)), which is O(1) in the input\nsize as in classical CNNs on images. The application of a multi-graph \ufb01lter to the matrix X\n\n\u02dcX =\n\n\u03b8jj(cid:48)Tj( \u02dc\u2206r)XTj(cid:48)( \u02dc\u2206c)\n\n(13)\n\nincurs an O(mn) computational complexity (here, as previously, \u02dc\u2206c = 2\u03bb\u22121\n2\u03bb\u22121\nSimilarly to (7), a multi-graph convolutional layer using the parametrization of \ufb01lters according\nto (13) is applied to q(cid:48) input channels (m \u00d7 n matrices X1, . . . , Xq(cid:48) or a tensor of size m \u00d7 n \u00d7 q(cid:48)),\n\nr,m\u2206r \u2212 I denote the scaled Laplacians).\np(cid:88)\n\nc,n\u2206c \u2212 I and \u02dc\u2206r =\n\n\uf8f6\uf8f8 = \u03be\n\n\uf8eb\uf8ed q(cid:48)(cid:88)\n\n\u03b8jj(cid:48),ll(cid:48)Tj( \u02dc\u2206r)Xl(cid:48)Tj(cid:48)( \u02dc\u2206c)\n\nl = 1, . . . , q,\n\n(14)\n\n\uf8f6\uf8f8 ,\n\nXl(cid:48) (cid:63) Yll(cid:48)\n\n\u02dcXl = \u03be\n\n\uf8eb\uf8ed q(cid:48)(cid:88)\n\nl(cid:48)=1\n\nl(cid:48)=1\n\nj,j(cid:48)=0\n\nproducing q outputs (tensor of size m \u00d7 n \u00d7 q). Several layers can be stacked together. We call such\nan architecture a Multi-Graph CNN (MGCNN).\nSeparable convolution. A simpli\ufb01cation of the multi-graph convolution is obtained considering\nthe factorized form of the matrix X = WH(cid:62) and applying one-dimensional convolutions to the\nrespective graph to each factor. Similarly to the previous case, we can express the \ufb01lters resorting to\nChebyshev polynomials,\n\n\u02dcwl =\n\nj Tj( \u02dc\u2206r)wl,\n\u03b8r\n\n\u02dchl =\n\nj(cid:48)Tj(cid:48)( \u02dc\u2206c)hl,\n\u03b8c\n\nl = 1, . . . , r\n\n(15)\n\np(cid:88)\n\nj(cid:48)=0\n\np(cid:88)\n\nj=0\n\nwhere wl, hl denote the lth columns of factors W, H and \u03b8r = (\u03b8r\n0, . . . , \u03b8c\np)\nare the parameters of the row- and column- \ufb01lters, respectively (a total of 2(p + 1) = O(1)).\nApplication of such \ufb01lters to W and H incurs O(m + n) complexity. Convolutional layers (14) thus\ntake the form\n\np) and \u03b8c = (\u03b8c\n\n0, . . . , \u03b8r\n\n\uf8eb\uf8ed q(cid:48)(cid:88)\n\np(cid:88)\n\nl(cid:48)=1\n\nj=0\n\n\u02dcwl = \u03be\n\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed q(cid:48)(cid:88)\n\np(cid:88)\n\nl(cid:48)=1\n\nj(cid:48)=0\n\nj,ll(cid:48)Tj( \u02dc\u2206r)wl(cid:48)\n\u03b8r\n\n\u02dchl = \u03be\n\nj(cid:48),ll(cid:48)Tj(cid:48)( \u02dc\u2206c)hl(cid:48)\n\u03b8c\n\n(16)\n\n\uf8f6\uf8f8 .\n\nWe call such an architecture a separable MGCNN or sMGCNN.\n3.2 Matrix diffusion with RNNs\nThe next step of our approach is to feed the spatial features extracted from the matrix by the MGCNN\nor sMGCNN to a recurrent neural network (RNN) implementing a diffusion process that progressively\nreconstructs the score matrix (see Figure 3). Modelling matrix completion as a diffusion process\n\n1For simplicity, we use the same degree p for row- and column frequencies.\n\n5\n\n\fX(t+1) = X(t) + dX(t)\n\ndX(t)\n\nX\n\nX(t)\n\nMGCNN\n\n\u02dcX(t)\n\nRNN\n\nrow+column \ufb01ltering\n\nFigure 1: Recurrent MGCNN (RMGCNN) architecture using the full matrix completion model and\noperating simultaneously on the rows and columns of the matrix X. Learning complexity is O(mn).\n\nH(t+1) = H(t) + dH(t)\n\nH(cid:62)\n\nH(t)\n\nGCNN\n\n\u02dcH(t)\n\nRNN\n\ncolumn \ufb01ltering\n\nW\n\nW(t+1) = W(t) + dW(t)\n\nW(t)\n\nGCNN\n\n\u02dcW(t)\n\nRNN\n\nrow \ufb01ltering\n\ndH(t)\n\ndW(t)\n\nFigure 2: Separable Recurrent MGCNN (sRMGCNN) architecture using the factorized matrix\ncompletion model and operating separately on the rows and columns of the factors W, H(cid:62). Learning\ncomplexity is O(m + n).\n\nt = 0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n2.26\n\n1.89\n\n1.60\n\n1.78\n\n1.31\n\n0.52\n\n0.48\n\n0.63\n\n0.38\n\n0.07\n\n0.01\n\n1.15\n\n1.04\n\n0.94\n\n0.89\n\n0.84\n\n0.76\n\n0.69\n\n0.49\n\n0.27\n\n0.11\n\n0.01\n\nFigure 3: Evolution of matrix X(t) with our architecture using full matrix completion model\nRMGCNN (top) and factorized matrix completion model sRMGCNN (bottom). Numbers indi-\ncate the RMS error.\n\nappears particularly suitable for realizing an architecture which is independent of the sparsity of\nthe available information. In order to combine the few scores available in a sparse input matrix, a\nmultilayer CNN would require very large \ufb01lters or many layers to diffuse the score information across\nmatrix domains. On the contrary, our diffusion-based approach allows to reconstruct the missing\ninformation just by imposing the proper amount of diffusion iterations. This gives the possibility\nto deal with extremely sparse data, without requiring at the same time excessive amounts of model\nparameters. See Table 3 for an experimental evaluation on this aspect.\nWe use the classical LSTM architecture [16], which has demonstrated to be highly ef\ufb01cient to\nlearn complex non-linear diffusion processes due to its ability to keep long-term internal states\n(in particular, limiting the vanishing gradient issue). The input of the LSTM gate is given by the\nstatic features extracted from the MGCNN, which can be seen as a projection or dimensionality\nreduction of the original matrix in the space of the most meaningful and representative information\n(the disentanglement effect). This representation coupled with LSTM appears particularly well-suited\nto keep a long term internal state, which allows to predict accurate small changes dX of the matrix\nX (or dW, dH of the factors W, H) that can propagate through the full temporal steps.\n\n6\n\n\fFigures 1 and 2 and Algorithms 1 and 2 summarize the proposed matrix completion architectures.\nWe refer to the whole architecture combining the MGCNN and RNN in the full matrix completion\nsetting as recurrent multi-graph CNN (RMGCNN). The factorized version with separable MGCNN\nand RNN is referred to as separable RMGCNN (sRMGCNN). The complexity of Algorithm 1 scales\nquadratically as O(mn) due to the use of MGCNN. For large matrices, Algorithm 2 that processes\nthe rows and columns separately with standard GCNNs and scales linearly as O(m + n) is preferable.\nWe will demonstrate in Section 4 that the proposed RMGCNN and sRMGCNN architectures show\nthemselves very well on different settings of matrix completion problems. However, we should note\nthat this is just one possible con\ufb01guration, which we by no means claim to be optimal. For example,\nin all our experiments we used only one convolutional layer; it is likely that better yet performance\ncould be achieved with more layers.\nAlgorithm 1 (RMGCNN)\ninput m \u00d7 n matrix X(0) containing initial val-\n\nAlgorithm 2 (sRMGCNN)\ninput m \u00d7 r factor H(0) and n \u00d7 r factor W(0)\n\nues\n\n1: for t = 0 : T do\n2:\n\n3:\n4:\n\nApply the Multi-Graph CNN (13) on X(t)\nproducing an m \u00d7 n \u00d7 q output \u02dcX(t).\nfor all elements (i, j) do\n\nApply RNN to q-dim \u02dcx(t)\nij\n(\u02dcx(t)\nij1, . . . , \u02dcx(t)\nupdate dx(t)\nij\n\n=\nijq) producing incremental\n\nend for\nUpdate X(t+1) = X(t) + dX(t)\n\n5:\n6:\n7: end for\n\nrepresenting the matrix X(0)\n\n1: for t = 0 : T do\n2:\n\n3:\n4:\n\nApply the Graph CNN on H(t) producing\nan n \u00d7 q output \u02dcH(t).\nfor j = 1 : n do\n\nApply RNN to q-dim \u02dch(t)\nj\n(\u02dch(t)\nj1 , . . . , \u02dch(t)\nupdate dh(t)\nj\n\n=\njq ) producing incremental\n\nend for\nUpdate H(t+1) = H(t) + dH(t)\nRepeat steps 2-6 for W(t+1)\n\n5:\n6:\n7:\n8: end for\n\n3.3 Training\nTraining of the networks is performed by minimizing the loss\n\n(cid:96)(\u0398, \u03c3) = (cid:107)X(T )\n\n\u0398,\u03c3(cid:107)2Gr\n\n+ (cid:107)X(T )\n\n\u0398,\u03c3(cid:107)2Gc\n\n+\n\n(cid:107)\u2126 \u25e6 (X(T )\n\n\u0398,\u03c3 \u2212 Y)(cid:107)2\nF.\n\n\u00b5\n2\n\n(17)\n\nHere, T denotes the number of diffusion iterations (applications of the RNN), and we use the\nnotation X(T )\n\u0398,\u03c3 to emphasize that the matrix depends on the parameters of the MGCNN (Chebyshev\npolynomial coef\ufb01cients \u0398) and those of the LSTM (denoted by \u03c3). In the factorized setting, we use\nthe loss\n\n(cid:96)(\u03b8r, \u03b8c, \u03c3) = (cid:107)W(T )\n\n\u03b8c,\u03c3(cid:107)2Gc\nwhere \u03b8c, \u03b8r are the parameters of the two GCNNs.\n\n+ (cid:107)H(T )\n\n\u03b8r,\u03c3(cid:107)2Gr\n\n+\n\n\u00b5\n2\n\n(cid:107)\u2126 \u25e6 (W(T )\n\n\u03b8r,\u03c3(H(T )\n\n\u03b8c,\u03c3)(cid:62) \u2212 Y)(cid:107)2\n\nF\n\n(18)\n\n4 Results2\nExperimental settings. We closely followed the experimental setup of [33], using \ufb01ve standard\ndatasets: Synthetic dataset from [19], MovieLens [29], Flixster [18], Douban [27], and YahooMusic\n[11]. We used disjoint training and test sets and the presented results are reported on test sets in all\nour experiments. As in [33], we evaluated MovieLens using only the \ufb01rst of the 5 provided data splits.\nFor Flixster, Douban and YahooMusic, we evaluated on a reduced matrix of 3000 users and items,\nconsidering 90% of the given scores as training set and the remaining as test set. Classical Matrix\nCompletion (MC) [9], Inductive Matrix Completion (IMC) [17, 42], Geometric Matrix Completion\n(GMC) [19], and Graph Regularized Alternating Least Squares (GRALS) [33] were used as baseline\nmethods. In all the experiments, we used the following settings for our RMGCNNs: Chebyshev\npolynomials of order p = 4, outputting k = 32-dimensional features, LSTM cells with 32 features\nand T = 10 diffusion steps (for both training and test). The number of diffusion steps T has been\nestimated on the Movielens validation set and used in all our experiments. A better estimate of T\ncan be done by cross-validation, and thus can potentially only improve the \ufb01nal results. All the\n\n2Code: https://github.com/fmonti/mgcnn\n\n7\n\n\fmodels were implemented in Google TensorFlow and trained using the Adam stochastic optimization\nalgorithm [20] with learning rate 10\u22123. In factorized models, ranks r = 15 and 10 was used for\nthe synthetic and real datasets, respectively. For all methods, hyperparameters were chosen by\ncross-validation.\n\nFigure 4: Absolute value |\u03c4 (\u02dc\u03bbc, \u02dc\u03bbr)| of the \ufb01rst\nten spectral \ufb01lters learnt by our MGCNN model.\nIn each matrix, rows and columns represent\nfrequencies \u02dc\u03bbr and \u02dc\u03bbc of the row and column\ngraphs, respectively.\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\ns\nn\no\np\ns\ne\nR\n\nr\ne\nt\nl\ni\n\nF\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u03bbr , \u03bbc\n\nFigure 5: Absolute values |\u03c4 (\u02dc\u03bbc)| and |\u03c4 (\u02dc\u03bbr)|\nof the \ufb01rst four column (solid) and row (dashed)\nspectral \ufb01lters learned by our sMGCNN model.\n\n4.1 Synthetic data\nWe start the experimental evaluation showing the performance of our approach on a small synthetic\ndataset, in which the user and item graphs have strong communities structure. Though rather simple,\nsuch a dataset allows to study the behavior of different algorithms in controlled settings.\nThe performance of different matrix completion methods is reported in Table 1, along with their\ntheoretical complexity. Our RMGCNN and sRMGCNN models achieve better accuracy than other\nmethods with lower complexity. Different diffusion time steps of these two models are visualized in\nFigure 3. Figures 4 and 5 depict the spectral \ufb01lters learnt by MGCNN and row- and column-GCNNs.\nWe repeated the same experiment assuming only the column (users) graph to be given. In this setting,\nRMGCNN cannot be applied, while sRMGCNN has only one GCNN applied on the factor H (the\nother factor W is free). Table 2 summarizes the results of this experiment, again, showing that our\napproach performs the best.\nTable 3 compares our RMGCNN with more classical multilayer MGCNNs. Our recurrent solutions\noutperforms deeper and more complex architectures, requiring at the same time a lower amount of\nparameters.\n\nTable 1: Comparison of different matrix comple-\ntion methods using users+items graphs in terms\nof number of parameters (optimization variables)\nand computational complexity order (operations\nper iteration). Big-O notation is avoided for clar-\nity reasons. Rightmost column shows the RMS\nerror on Synthetic dataset.\n\nTable 2: Comparison of different matrix comple-\ntion methods using users graph only in terms of\nnumber of parameters (optimization variables)\nand computational complexity order (operations\nper iteration). Big-O notation is avoided for clar-\nity reasons. Rightmost column shows the RMS\nerror on Synthetic dataset.\n\nMETHOD\nGMC\nGRALS\nsRMGCNN\nRMGCNN\n\nPARAMS NO. OP.\n\nmn\n\nm + n\n\n1\n1\n\nmn\n\nm + n\nm + n\n\nmn\n\nRMSE\n0.3693\n0.0114\n0.0106\n0.0053\n\nMETHOD\nGRALS\nsRMGCNN\n\nPARAMS NO. OP.\nm + n\nm + n\nm + n\n\nm\n\nRMSE\n0.0452\n0.0362\n\nTable 3: Reconstruction errors for the synthetic dataset between multiple convolutional layers\narchitectures and the proposed architecture. Chebyshev polynomials of order 4 have been used for\nboth users and movies graphs (q(cid:48)MGCq denotes a multi-graph convolutional layer with q(cid:48) input\nfeatures and q output features).\n\nMethod\nMGCNN3layers\nMGCNN4layers\nMGCNN5layers\nMGCNN6layers\nRMGCNN\n\nParams\n\nArchitecture\n\n1MGC32, 32MGC10, 10MGC1\n\n9K\n1MGC32, 32MGC32 \u00d7 2, 32MGC1\n53K\n1MGC32, 32MGC32 \u00d7 3, 32MGC1\n78K\n104K 1MGC32, 32MGC32 \u00d7 4, 32MGC1\n9K\n\n1MGC32 + LSTM\n\nRMSE\n0.0116\n0.0073\n0.0074\n0.0064\n0.0053\n\n8\n\n\f4.2 Real data\n\nFollowing [33], we evaluated the proposed approach on the MovieLens, Flixster, Douban and\nYahooMusic datasets. For the MovieLens dataset we constructed the user and item (movie) graphs as\nunweighted 10-nearest neighbor graphs in the space of user and movie features, respectively. For\nFlixster, the user and item graphs were constructed from the scores of the original matrix. On this\ndataset, we also performed an experiment using only the users graph. For the Douban dataset, we\nused only the user graph (provided in the form of a social network). For the YahooMusic dataset,\nwe used only the item graph, constructed with unweighted 10-nearest neighbors in the space of\nitem features (artists, albums, and genres). For the latter three datasets, we used a sub-matrix of\n3000 \u00d7 3000 entries for evaluating the performance. Tables 4 and 5 summarize the performance of\ndifferent methods. sRMGCNN outperforms the competitors in all the experiments.\n\nTable 4: Performance (RMS error)\nof different matrix completion meth-\nods on the MovieLens dataset.\n\nMETHOD\nGLOBAL MEAN\nUSER MEAN\nMOVIE MEAN\nMC [9]\nIMC [17, 42]\nGMC [19]\nGRALS [33]\nsRMGCNN\n\nRMSE\n1.154\n1.063\n1.033\n0.973\n1.653\n0.996\n0.945\n0.929\n\nTable 5: Performance (RMS error) on several datasets. For\nDouban and YahooMusic, a single graph (of users and items\nrespectively) was used. For Flixster, two settings are shown:\nusers+items graphs / only users graph.\n\nMETHOD\nGRALS\nsRMGCNN\n\nFLIXSTER\n\n1.3126 / 1.2447\n1.1788 / 0.9258\n\nDOUBAN YAHOOMUSIC\n0.8326\n0.8012\n\n38.0423\n22.4149\n\n5 Conclusions\nIn this paper, we presented a new deep learning approach for matrix completion based on multi-graph\nconvolutional neural network architecture. Among the key advantages of our approach compared to\ntraditional methods is its low computational complexity and constant number of degrees of freedom\nindependent of the matrix size. We showed that the use of deep learning for matrix completion allows\nto beat related state-of-the-art recommender system methods. To our knowledge, our work is the \ufb01rst\napplication of deep learning on graphs to this class of problems. We believe that it shows the potential\nof the nascent \ufb01eld of geometric deep learning on non-Euclidean domains, and will encourage future\nworks in this direction.\n\nAcknowledgments\nFM and MB are supported in part by ERC Starting Grant No. 307047 (COMET), ERC Consolidator\nGrant No. 724228 (LEMAN), Google Faculty Research Award, Nvidia equipment grant, Radcliffe\nfellowship from Harvard Institute for Advanced Study, and TU Munich Institute for Advanced Study,\nfunded by the German Excellence Initiative and the European Union Seventh Framework Programme\nunder grant agreement No. 291763. XB is supported in part by NRF Fellowship NRFF2017-10.\n\nReferences\n[1] K. Benzi, V. Kalofolias, X. Bresson, and P. Vandergheynst. Song recommendation with non-\n\nnegative matrix factorization and graph total variation. In Proc. ICASSP, 2016.\n\n[2] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. Learning\nclass-speci\ufb01c descriptors for deformable shapes using localized spectral convolutional networks.\nComputer Graphics Forum, 34(5):13\u201323, 2015.\n\n[3] D. Boscaini, J. Masci, E. Rodol\u00e0, and M. M. Bronstein. Learning shape correspondence with\n\nanisotropic convolutional neural networks. In Proc. NIPS, 2016.\n\n[4] D. Boscaini, J. Masci, E. Rodol\u00e0, M. M. Bronstein, and D. Cremers. Anisotropic diffusion\n\ndescriptors. Computer Graphics Forum, 35(2):431\u2013441, 2016.\n\n[5] J. Breese, D. Heckerman, and C. Kadie. Empirical Analysis of Predictive Algorithms for\n\nCollaborative Filtering. In Proc. Uncertainty in Arti\ufb01cial Intelligence, 1998.\n\n9\n\n\f[6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning:\n\ngoing beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[7] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. 2013.\n\n[8] E. Cand\u00e8s and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[9] E. Candes and B. Recht. Exact matrix completion via convex optimization. Comm. ACM,\n\n55(6):111\u2013119, 2012.\n\n[10] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\n\nwith fast localized spectral \ufb01ltering. In Proc. NIPS, 2016.\n\n[11] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! music dataset and KDD-Cup\u201911.\n\nIn KDD Cup, 2012.\n\n[12] D. K. Duvenaud et al. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\n\nProc. NIPS, 2015.\n\n[13] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proc.\n\nIJCNN, 2005.\n\n[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua. Neural collaborative \ufb01ltering. In Proc.\n\nWWW, 2017.\n\n[15] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data.\n\narXiv:1506.05163, 2015.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[17] P. Jain and I. S. Dhillon. Provable inductive matrix completion. arXiv:1306.0626, 2013.\n\n[18] M. Jamali and M. Ester. A matrix factorization technique with trust propagation for recommen-\n\ndation in social networks. In Proc. Recommender Systems, 2010.\n\n[19] V. Kalofolias, X. Bresson, M. M. Bronstein, and P. Vandergheynst. Matrix completion on\n\ngraphs. 2014.\n\n[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.\n\n[21] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\n2017.\n\n[22] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8):30\u201337, 2009.\n\n[23] S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. Lee, B. Glocker, and D. Rueckert. Distance\nmetric learning using graph convolutional networks: Application to functional brain networks.\nIn Proc. MICCAI, 2017.\n\n[24] D. Kuang, Z. Shi, S. Osher, and A. L. Bertozzi. A harmonic extension approach for collaborative\n\nranking. CoRR, abs/1602.05127, 2016.\n\n[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proc. IEEE, 86(11):2278\u20132324, 1998.\n\n[26] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks.\n\n2016.\n\n[27] H. Ma, D. Zhou, C. Liu, M. Lyu, and I. King. Recommender systems with social regularization.\n\nIn Proc. Web Search and Data Mining, 2011.\n\n10\n\n\f[28] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural\n\nnetworks on Riemannian manifolds. In Proc. 3DRR, 2015.\n\n[29] B. N. Miller et al. MovieLens unplugged: experiences with an occasionally connected recom-\n\nmender system. In Proc. Intelligent User Interfaces, 2003.\n\n[30] F. Monti, D. Boscaini, J. Masci, E. Rodol\u00e0, J. Svoboda, and M. M. Bronstein. Geometric deep\n\nlearning on graphs and manifolds using mixture model CNNs. In Proc. CVPR, 2017.\n\n[31] S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrerro Moreno, B. Glocker, and D. Rueckert.\nSpectral graph convolutions for population-based disease prediction. In Proc. MICCAI, 2017.\n\n[32] M. Pazzani and D. Billsus. Content-based Recommendation Systems. The Adaptive Web, pages\n\n325\u2013341, 2007.\n\n[33] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon. Collaborative \ufb01ltering with graph\n\ninformation: Consistency and scalable methods. In Proc. NIPS, 2015.\n\n[34] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Trans. Neural Networks, 20(1):61\u201380, 2009.\n\n[35] S. Sedhain, A. Menon, S. Sanner, and L. Xie. Autorec: Autoencoders meet collaborative\n\n\ufb01ltering. In Proc. WWW, 2015.\n\n[36] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with\n\ngraph convolutional recurrent networks. arXiv:1612.07659, 2016.\n\n[37] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging \ufb01eld\nof signal processing on graphs: Extending high-dimensional data analysis to networks and other\nirregular domains. IEEE Sig. Proc. Magazine, 30(3):83\u201398, 2013.\n\n[38] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-Margin Matrix Factorization. In Proc. NIPS,\n\n2004.\n\n[39] Y. Suhara, X. Dong, and A. S. Pentland. Deepshop: Understanding purchase patterns via deep\n\nlearning. In Proc. International Conference on Computational Social Science, 2016.\n\n[40] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropa-\n\ngation. In Advances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\n[41] M. Vestner, R. Litman, E. Rodol\u00e0, A. Bronstein, and D. Cremers. Product manifold \ufb01lter:\nNon-rigid shape correspondence via kernel density estimation in the product space. In Proc.\nCVPR, 2017.\n\n[42] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application\n\nto multi-label learning. In Proc. NIPS, 2013.\n\n[43] F. Yanez and F. Bach. Primal-dual algorithms for non-negative matrix factorization with the\nkullback-leibler divergence. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE\nInternational Conference on, pages 2257\u20132261. IEEE, 2017.\n\n[44] Y. Zheng, B. Tang, W. Ding, and H. Zhou. A neural autoregressive approach to collaborative\n\n\ufb01ltering. In Proc. ICML, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2059, "authors": [{"given_name": "Federico", "family_name": "Monti", "institution": "Universit\u00e0 della Svizzera italiana"}, {"given_name": "Michael", "family_name": "Bronstein", "institution": "USI Lugano / Tel Aviv University / Intel"}, {"given_name": "Xavier", "family_name": "Bresson", "institution": "NTU"}]}